Table of Contents

This is some text inside of a div block.

Over the last decade, advancements in the fields of AI and machine learning have been moving at such breakneck speeds it almost feels like none of us can go to sleep lest we miss the next big thing. Now, before the Internet can even catch a breath from gushing over the magical applications of AI image generation, the hallowed “big thing” is already here.

Are you ready for it?

New frontiers in machine learning

So, what recent developments in machine learning has the world floored this time? Here’s the big reveal: we can now generate videos using AI.

Yes, you read that right. In late September, Meta debuted Make-A-Video, a tool built upon new AI image generation technology to enable text-to-video generation. Then, just days later, Google released not one, but two, outstanding video generator models — Imagen Video and Phenaki — with capabilities far exceeding the generative powers of popular AI image generators flooding your Twitter threads recently. The announcements even took AI experts by surprise, calling it a “sooner-than-expected Dall-E moment for text-to-video generation”.

But before we deep dive into the specifics, humor me by playing this mini game. Look closely: Which of the two videos below do you think was generated by an AI?

coffee pouring into a cup — Coffee pouring into a cup

Slow motion clip of a water-filled balloon bursting

If you answered just one, sorry but you’ve been fooled! They were both generated by Google’s Imagen Video.

This explosion of new technology didn’t just come without warning. If you’ve been following the news closely, talk of text-to-video generation — T2V generation in short — has been brewing since April this year. Shortly after, a tidal wave of excitement swept over mainstream media when AI text-to-image generators were finally made available for public use. ICYMI, I wrote an explainer on AI image generators in a separate article — check it out for a detailed brief!

Now, allow me to walk you through the amazing research behind this text-to-video technology, and discover both its magnificent beauty and undeniable flaws.

How does text-to-video generation work?

The answer to this question is pretty simple: AI video generation is an extension of the technology that powered AI image generators.

To understand how it actually works, we need to first talk about the relationship between images and videos. Broken down to its simplest form, an image takes up one frame, and when strung together with many other images, makes a video. This makes it easy to understand how image generation technology could serve as a strong foundation for video generation.

While Meta and Google uses different AI models — or video diffusion models, as researchers call it — the underlying idea is the same: the neural network first converts textual information into visual information, then stitches them with real-world knowledge it has gleaned from studying how things move and behave.

During the training phase, neural networks are force-fed with tons of text-video data. Think of them as a 5-year-old toddler learning about how the world works through watching hours upon hours of subtitled movies. In the early stages, the toddler might only be able to describe what a balloon floating into the sky looks like, but as they watch more, they also develop an understanding of the world beyond what the subtitles tell them, which allows them to imagine how certain objects might behave despite not having watched it before. This is the “intuition” that humans build up over years of experience, shaping and reshaping the way we see and understand the world around us.

Borrowing the above analogy, if you were to ask the AI video generator to render a video of a panda driving, the neural network does not need to have “watched” something like that before. It can generate a video so long as it has learned the basic concepts of “car” and “panda”, all while understanding that someone has to be inside the vehicle to drive it.

While grasping such logical concepts might seem intuitive to us humans, it’s far from easy for machines. In particular, being able to conceptualise an object in 3D form is a challenging problem that AI researchers have been attempting to solve for the longest time. This is the reason why occlusion — the idea that objects moving behind other objects do not simply disappear but are blocked from sight — is a tricky problem that machines consistently struggle with.

Fortunately, many of these problems have now been ironed out thanks to breakthrough advancements in machine learning, bringing greater fidelity to the images and videos they create.

Let’s take a quick look at what current text-to-video generation technology is really capable of!

What kind of videos can you generate?

I’m sure you’re just as curious as I am about this. Don’t worry, I’ve done the research for us — you’re welcome.

Resolution

According to official research papers, Google’s Imagen Video takes the trophy for highest resolution, generating high-definition videos through its cascaded diffusion models at 1280×768 pixels. In comparison, Meta’s Make-A-Video comes in at 768×768 pixels, a step-down in video quality but still good enough to be considered high definition.

Duration

In terms of duration, however, Google’s Phenaki is the clear winner of the lot. Sample videos on the Phenaki website last 2.5 minutes, though researchers claim its ability to generate “arbitrary long videos conditioned on a sequence of prompts”. Compared to Phenaki’s “visual story telling” generations, Imagen Video trades duration for resolution, managing 5.3s long videos at 24 frames per second.

Type of input

Due to the varying strengths and characteristics of different diffusion models, the possible types of input one can provide AI video generators differs according to the model. Without going into the specifics, here is a list of inputs you could give a video generator to render a video:

Text prompt – ranging from a few words to a sequence of sentences
A single image – image will be animated into a video
A pair of images – the video generator will fill in the gaps between the images
A video – the video generator will render a new video as a variant to the input video

This technology is advancing exponentially, so there’s a possibility that the information in this article may become obsolete in a few weeks’ time. Be sure to keep an eye out for new research in this field!

Is AI video generation technology ready yet?

The short answer is no. On top of its newness, there are a few other reasons that can clue us in on why this might be the case, and one of them concerns its quality and consistency.

First, let’s take a look at some examples from Google’s Imagen Video.

Text prompt: a shark swimming in clear Caribbean ocean

Text prompt: melting ice cream dripping down the cone

Looking at the sample videos, we see that it performs relatively well for commonly seen actions and objects, such as a shark swimming in the ocean or melting ice cream. The motions are natural and realistic — spectacular results for a technology at its infancy.

However, when we start to generate objects and animals outside of their native environments, the quality drops significantly.

Text prompt: a happy elephant wearing a birthday hat walking under the sea

Text prompt: a teddy bear running in New York City

We can see the distortion clearly in the short clip of the elephant — its front legs morph into one another constantly, perhaps a result of the AI applying the concept of how things move underwater. The videos also appear “glitchy” and “choppy”, an indication that the AI still has much to learn about how things behave in different environments.

Several sample videos were outright wacky:

Text prompt: a goldendoodle playing by the lake

At present, one of the most awe-inspiring developments is Imagen Video’s ability to render text clearly, something that commercially deployed AI image generators struggle to do even now. Additionally, it demonstrates a good understanding of depth and three-dimensionality, producing drone fly-through shots that rotate around and capture objects from different angles without obvious distortion.

Remember when I talked about the problem with 3D concepts? Yup, they solved it.

Text prompt: a bunch of autumn leaves falling down on a calm lake to form the text "Imagen Video"

Text prompt: drone fly-through interior of Sagrada Família Cathedral

While impressive for a world premiere, the quality of videos generated is inconsistent. The researchers themselves also know this: So far, both Meta and Google have not announced any plans for public release, stating data biases and ethical concerns. Unlike AI image generation, it would be some time before text-to-video generation matures into a commercially viable product open for public use. Until then, AI-generated videos remain a distant dream.

Luckily, the good news is that video generation technology is improving rapidly as we speak, and we can already see huge potential through its application across various industries.

Versatile applications, endless potential

As a creative writer myself, I’m excited about the potential applications AI video generators can have on the world, particularly in the area of content creation.

Below are my humble predictions (and secret wishlist) of how text-to-video technology can make a positive impact:

1. Storyboarding

Ads, cinema, gaming — what do these three have in common?

They tell stories in a visual medium.

Storyboarding, the process of producing a series of pictures that show the outline of the story, is an essential part of conceptualisation. For any scriptwriter, the ability to convey their creative vision to the team and translate it on-screen is just as important, if not more important, than the content that they write.

Oftentimes, the storyboarding process can be both time-consuming and mentally exhausting, particularly when it’s unclear how a scene should be best conveyed. With AI video generators or animated video makers imagined settings and characters can be generated smoothly in sequence, giving everyone a preliminary look of the envisioned story.

[Phenaki] Text prompts: A photorealistic teddy bear is swimming in the ocean at San Francisco. The teddy bear goes under water. The teddy bear keeps swimming under the water with colorful fishes. A panda bear is swimming under water.

Imagine having the ability to storyboard as you write! That would be a dream come true for all creatives.

2. Content creation

Videos are all the hype.

The rising popularity of short-video platforms like TikTok in recent years have cemented the status of short-form videos as the preferred content format for the bulk of Internet users. To keep up with this market trend, social media giants such as Instagram and YouTube have also begun their foray into the short-video market, tweaking their algorithms to push short-form video content to users.

Unfortunately, video production is a costly process for content creators, both in terms of money and resources. This is especially true for small-time creators who lack the time and expertise to shoot, edit, and stitch short-form videos all on their own. The result is market domination for industry Goliaths and the inadvertent smothering of independent creators.

AI video generation has the potential to correct this imbalance. Text-to-video technology condenses the video production process to just a single step: writing. It eliminates the need for specialised skills in video editing or audio stitching, allowing creators to direct their creative energy towards ideation.

Say goodbye to unnecessary stress over video animation!

3. Immersive learning

Picture a classroom environment where learning is augmented by VR (virtual reality) and video simulations. Doesn’t that sound fun?

For learning to be effective, it has to first be immersive. AI video generation can do this easily with the help of VR technology, transporting students back in time to stormy Caribbean seas alongside swashbuckling pirates.

Text prompt: flying through an intense battle between pirate ships in a stormy ocean

Subjects such as medicine and geography can also benefit greatly from the use of AI-generated videos, allowing for a more immersive learning experience that simulates reality.

Of course, the above are just some of the potential applications of this exciting new technology. Depending on the direction that AI video generation takes, the world could be so much better — but also a lot worse.

Dangers of AI video generation

As with any new tool, there are unintended dangers and risks associated with its use.

Deepfakes, digitally altered media that impersonate a person, is a type of content that could be maliciously exploited by bad actors. A 2021 study on misinformation found that video fakes are potentially more sinister as people have a tendency to “believe what they see with their own eyes”.

While popular applications of deepfake technology have been mostly harmless (see video of Simon Cowell performing at “America’s Got Talent”), others have deadly real-world consequences: a spate of mob riots and lynchings in India were triggered by fake videos spread via social media. Reports of revenge porn created using deepfake technology have also heightened global fears of AI-generated media supercharging fraud and identity theft.

The powerful psychological appeal behind video content explains why critics of AI technology are citing concerns over further abuse leading to declining trust in media. If we allow text-to-video technology to be used by anyone, we might just be ushering in a new era of online misinformation — now made even more believable and convenient with AI.

Image generated by Hypotenuse AI using the text prompt: dawn breaking, photorealistic, highly detailed, high resolution, octane render

Dangers can also come from within. Specifically, data biases that are encoded into the neural network.

The problem of data biases in machine learning models is well documented. In 2018, Amazon scrapped their AI recruiting tool after it unfairly penalized applications sent in by women. In the same year, MIT researchers found that facial analysis technologies did not perform as well when made to scan faces of minorities, in particular minority women.

These cases are not simply one-off errors in coding. Researchers and engineers alike have warned of this insidious problem for years, casting serious doubt over the reliability of AI machines to generate output free of human biases. At the moment, there are no solutions to this problem.

At the end of the day, an AI model is only as good, or as bad, as the data it is trained on. The onus is on us to ensure our human biases do not creep into the algorithms that we increasingly depend on to assist us in our daily work.

Will we overcome this eventually? Only time will tell.

What does this development mean for us?

Hmm… Not much.

At this stage, it’s too early for us to know where text-to-video technology is headed. While present showcases have certainly shown promising results, perennial concerns over in-built data biases have proved challenging to mitigate. As I write this, Google researchers are working overtime to resolve existing issues while governments around the world enact laws to regulate AI.

Realistically, AI video generators would most probably take a few more years before it has a tangible impact on our way of life. Even AI image generation, a relatively novel tool in the space, is still in the midst of finding its place in the content creation market and beyond.

For now, anything can happen. Are you ready for the next big thing?

‍Friendly PSA: Write engaging articles like this — or a professional email to your boss demanding a pay raise — with our AI writing tool.

Alex

Content Writer

Alex is a seasoned writer responsible for creating valuable, well-researched content for various industries like tech and ecommerce.

The New Wave: AI Video Generators

New frontiers in machine learning

How does text-to-video generation work?