Sora come out. Ai text to vedio bright our eyes.
Time:2024-02-21
Views:13167
1、 Introduction to Sora‘s Concept
On February 16, 2024, OpenAI released the large modeling tool for text to video, Sora (using natural language to describe and generate videos). Once this news was released, global social media platforms and the entire world were once again shocked by OpenAI. The height of AI videos has been suddenly raised by Sora. It should be noted that cultural video tools such as Runway Pika are still breaking through the coherence of a few seconds, while Sora can directly generate a 60 second long one shot to the end video. It should be noted that Sora has not yet officially released, so this effect can already be achieved.
The name Sora comes from the Japanese word for "sky" (そら sora), meaning "sky", to indicate its infinite creative potential.
The advantage of Sora compared to the aforementioned AI video models is that it can accurately present details, understand the existence of objects in the physical world, and generate characters with rich emotions. Even this model can generate videos based on prompts, still images, and even fill in missing frames in existing videos.
2、 The implementation path of Sora
The significance of Sora lies in its once again pushing AIGC‘s upper limit in AI driven content creation. Prior to this, text models such as ChatGPT had already begun to assist in content creation, including the generation of illustrations and visuals, and even the use of virtual humans to create short videos. Sora, on the other hand, is a large model that focuses on video generation. By inputting text or images, videos can be edited in various ways, including generation, connection, and expansion. It belongs to the category of multimodal large models. This type of model has been extended and expanded on the basis of language models such as GPT.
Sora uses a method similar to GPT-4 to manipulate text tokens to process video patches. The key innovation lies in treating video frames as patch sequences, similar to word tokens in language models, enabling them to effectively manage various video information. By combining text conditions, Sora is able to generate contextually relevant and visually coherent videos based on text prompts.
In principle, Sora mainly achieves video training through three steps. Firstly, there is a video compression network that reduces the dimensionality of videos or images into a compact and efficient form. Next is spatiotemporal patch extraction, which decomposes the view information into smaller units, each containing a portion of the spatial and temporal information in the view, so that Sora can perform targeted processing in subsequent steps. Finally, video generation is achieved by decoding and encoding input text or images, and the Transformer model (i.e. ChatGPT basic converter) decides how to convert or combine these units to form the complete video content.
Overall, the emergence of Sora will further promote the development of AI video generation and multimodal large models, bringing new possibilities to the field of content creation.
3、 Sora‘s 6 Advantages
The Daily Economic News reporter sorted out the report and summarized six advantages of Sora:
(1) Accuracy and diversity: Sora can convert short text descriptions into high-definition videos that grow up to 1 minute. It can accurately interpret the text input provided by users and generate high-quality video clips with various scenes and characters. It covers a wide range of themes, from characters and animals to lush landscapes, urban scenes, gardens, and even underwater New York City, providing diverse content according to user requirements. According to Medium, Sora can accurately explain long prompts of up to 135 words.
(2) Powerful language understanding: OpenAI utilizes the recapping technique of the Dall · E model to generate descriptive subtitles for visual training data, which not only improves the accuracy of the text but also enhances the overall quality of the video. In addition, similar to DALL · E 3, OpenAI also utilizes GPT technology to convert short user prompts into longer detailed translations and send them to video models. This enables Sora to accurately generate high-quality videos according to user prompts.
(3) Generate videos from images/videos: Sora can not only convert text into videos, but also accept other types of input prompts, such as existing images or videos. This enables Sora to perform a wide range of image and video editing tasks, such as creating perfect loop videos, converting static images into animations, and expanding videos forward or backward. OpenAI presented a demo video generated from images based on DALL · E 2 and DALL · E 3 in the report. This not only proves Sora‘s powerful capabilities, but also demonstrates its infinite potential in the fields of image and video editing.
(4) Video extension function: Due to the ability to accept diverse input prompts, users can create videos based on images or supplement existing videos. As a Transformer based diffusion model, Sora can also expand videos forward or backward along the timeline.
(5) Excellent device compatibility: Sora has excellent sampling capabilities, ranging from 1920x1080p in widescreen to 1080x1920 in portrait, and can easily handle any video size between the two. This means that Sora can generate content that perfectly matches its original aspect ratio for various devices. Before generating high-resolution content, Sora can quickly create content prototypes at a small size.
(6) Consistency and continuity between scenes and objects: Sora can generate videos with dynamic perspective changes, and the movement of characters and scene elements in three-dimensional space appears more natural. Sora is able to handle occlusion issues well. One problem with existing models is that when objects leave the field of view, they may not be able to track them. By providing multiple frame predictions at once, Sora ensures that the subject of the image remains unchanged even when temporarily out of view.
4、 Disadvantages of Sora
Although Sora is very powerful, OpenAI Sora has certain problems in simulating physical phenomena in complex scenes, understanding specific causal relationships, handling spatial details, and accurately describing events that change over time.
In this video generated by Sora, we can see that the overall picture has a high degree of coherence, with excellent performance in terms of image quality, details, lighting, and color. However, when we observe carefully, we will find that the legs of the characters in the video are slightly twisted, and the movement of the steps does not match the overall tone of the picture.
In this video, it can be seen that the number of dogs is increasing, and although the connection is very smooth during this process, it may have deviated from our initial requirements for this video.
(1) Inaccurate simulation of physical interaction:
The Sora model is not precise enough in simulating basic physical interactions, such as glass breakage. This may be because the model lacks sufficient examples of such physical events in the training data, or the model is unable to fully learn and understand the underlying principles of these complex physical processes.
(2) Incorrect change in object state:
When simulating interactions involving significant changes in object state, such as eating food, Sora may not always accurately reflect the changes. This indicates that the model may have limitations in understanding and predicting the dynamic process of object state changes.
(3) Incoherence in long-term video samples:
When generating long duration video samples, Sora may produce incoherent plots or details, which may be due to the model‘s difficulty in maintaining contextual consistency over long time spans.
(4) The sudden appearance of an object:
Objects may appear in videos for no reason, indicating that the model still needs to improve its understanding of spatial and temporal continuity.
Here we need to introduce the concept of "world model"
What is the world model? Let me give an example.
In your memory, you know the weight of a cup of coffee. So when you want to pick up a cup of coffee, your brain accurately predicts how much force should be used. So, the cup was picked up smoothly. You didn‘t even realize it. But what if there happens to be no coffee in the cup? You will use a lot of force to grab a very light cup. Your hand can immediately feel something wrong. Then, you will add a note to your memory: the cup may also be empty. So, the next time you make a prediction, you won‘t be wrong. The more things you do, the more complex world models will form in your brain for more accurate prediction of the world‘s reactions. This is the way humans interact with the world: the world model.
Videos generated with Sora may not always leave marks when bitten. It can also go wrong at times. But this is already very powerful and terrifying. Because "remember first, predict later" is the way humans understand the world. This mode of thinking is called the world model.
There is a sentence in Sora‘s technical documentation:
Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world
Translated:
Our results indicate that expanding video generation models is a promising path towards building a universal physical world simulator.
The meaning is that what OpenAI ultimately wants to do is not a tool for "cultural videos", but a universal "physical world simulator". That is the world model, modeling the real world.