Data Digest 2
Posted on March 3, 2024 • 4 minutes • 793 words
Table of contents
Overview
You have probably heard of Sora on the news, or heard about it from somewhere. Basically, Sora is OpenAI’s new text to video generation model. It’s really powerful, as it can generate up to a minute long videos from text, with high quality and prompt adherence, meaning it follows the original prompt very well. It has the ability to incorporate motion very seamlessly into its videos, and can accurately understand prompts. It also has the ability to transform from video to video, such as changing the scenery and background. [Click to check out generated videos].
But you might have seen many AI generated “videos” online (like cursed Will Smith eating spaghetti) but those videos don’t have a lot of cohesion in between frames, since people use image tools to stitch multiple photos together. It might seem all perfect, but the model does have problems. It generally has a major problem with physics, such as simulating glass breaking, direction (left or right), cause and effect and camera trajectory. [Click to see Issue Examples]. For example, someone might eat a cookie in the video, but the cookie doesn’t have a bite mark. Or the way the camera moves is weird or unrealistic in a real-life setting. But how does it even work?
Technical Aspects
To summarize, OpenAI has essentially trained a text-conditional diffusion model on images and videos, of diverse durations, aspect ratios and resolutions. But let’s break down the steps on how they did it.
Step 1, they needed to get the videos and provide captions for the videos. They scraped a wide variety of videos from the internet, to include a diverse set of properties for the model to learn off of. Now they needed to get captions for their videos. For that, OpenAI trained an advanced captioner model, that would be able to generate descriptive and meaningful captions given videos, which then they would use to train the model.
Before talking about the next step, let’s talk about LLMs, transformers, and latent space. LLMs, are well Large Language Models, who have the ability to generate human-like text. For LLMs to understand text, it firstly needs to be tokenized, or converted into numbers. LLMs are basically text transformers, as transformers generally handle sequential data (something like videos). Latent space is a concept where we take data and compress it into smaller and lower dimensional data, allowing us to represent huge amounts of data in less space. Now with all that in mind, let’s go back to Sora.
We can think of the Sora model as three models, the first one being an encoder. So firstly, what happens is that the video is taken, and using a visual encoder, is compressed into a latent space smaller than the video itself. Then using the techniques of tokenization, it is “tokenized” into smaller spacetime patches. These patches are very similar to tokens for text in LLMs.
After this, we use what’s called a diffusion transformer, which is a combination of a diffusion model and a transformer model. If you remember, we first use some noise (along with the tokenized text prompt, which works similarly to ChatGPT tokenization of user text) and we use the techniques in diffusion and transformer models to gradually denoise our video. The expected output should match the compressed latent space, as the noise is being removed over many iterations. [Click to see an example of the diffusion process in action].
Finally, now that we have generated a valid latent space, we need to re-convert that back into an actual video. To perform this, we use a decoder, whose job is to take in the generated latent space, and maps it back to video. One major advantage of Sora is due to its architecture, it can support many different types of framing and aspect ratios. To summarize, Sora was built using ideas from the encoder-decoder architecture, diffusion models, and transformers. Then it was trained on “tokenized” videos, gradually removing noise to try to match the latent space. Onto something else, what are the implications of letting people use Sora?
Ethical Concerns
There are major ethical concerns with this tool becoming open to the public. Firstly, it’s easy to generate fake content that can harm people significantly. Secondly, it puts many jobs and industries at risk, such as filmmaking, animation and video production. Finally, it can be a tool used by anybody to generate propaganda or hateful content. OpenAI is currently considering all of these issues, working with red testers to handle risks similar to these.
Resources Check out more videos generated by Sora at: https://openai.com/sora
Find out more about how Sora works at: https://openai.com/research/video-generation-models-as-world-simulators
To learn more about ethical concerns go to: https://medium.com/@stirikaai/rasora-ai-revolutionizing-content-creation-and-raising-ethical-c oncerns-e69c3d55336c
To understand diffusion transformers check out: https://www.linkedin.com/pulse/diffusion-transformer-its-applications-including-sora-frank-viqa e/