After Sora: The Evolution of Chinese Chess in Video Generation Models

12/12 2024 465

Since its release by OpenAI on February 16th, Sora has been scrutinized as a futuristic tech product. Finally, on December 10th, the official version of Sora was unveiled, capable of generating videos with a maximum resolution of 1080p and a duration of up to 20 seconds.

OpenAI CEO Altman described the official release of Sora as the GPT-1 moment in the realm of video generation.

However, unlike the GPT era, domestic AI enterprises have not mirrored OpenAI's strides in video generation. Instead, they've exhibited a more nuanced and diverse response.

Some have opted to emulate. For instance, following the introduction of Sora, internet giants such as Alibaba, ByteDance, Kuaishou, and Tencent, along with AI companies like Zhipu AI, MiniMax, Aishi Technology, Shengshu Technology, and others, have successively unveiled video generation models. Many claim to have matched or surpassed the capabilities of Sora's preview version.

Others have chosen a different path. This includes internet companies like Baidu, where Li Yanhong has unequivocally stated, "No matter how popular Sora becomes, Baidu will not engage in it." AI firms like Baichuan Intelligence have also made it clear they won't develop a Sora-like model. While companies like Dark Side of the Moon, SenseTime, and Zero-One Wanwu do possess text-to-video models, they're not prioritizing them.

The video generation race is no longer adhering to the GPT-era model where OpenAI led and domestic enterprises followed suit. Post-Sora, the domestic AI landscape has started to carve its own path, presenting a more intricate scenario.

Domestic tech firms capable of developing general-purpose foundational large models are beginning to show significant differences in their technological roadmap and commercial prospects. Let's dissect the Chinese chess game in video generation based on whether domestic enterprises emulate Sora.

First, let's clarify what domestic tech firms, benchmarking against the Sora model, are actually doing.

In essence, the core technology roadmap of the Sora video generation model combines Diffusion and Transformer, utilizing text (natural language), images, and videos as prompts for video creation.

Models benchmarked against Sora must possess at least several key characteristics:

1. Versatility: Capable of generating videos of any content, not limited to a specific style, industry, or character.

2. High Quality: High image resolution (up to 1080p), extended video duration (up to one minute), and strong scene consistency (understanding physical laws).

Unlike when ChatGPT launched, domestic tech enterprises were not caught unprepared by Sora. However, whether to emulate is no longer as unanimous as it was with ChatGPT, with attitudes diverging into three categories:

The first category includes those who have clearly chosen to emulate.

Among internet companies, those with video as their core business, such as ByteDance and Kuaishou, as well as the comprehensive tech firm Tencent, have mature digital infrastructure, abundant technical talent, and inherent video product expertise, so they almost immediately chose to follow suit. ByteDance launched Dreamnia, Kuaishou released the Kelinai large model, and Tencent, with its Hunyuan large model as the core, unveiled and open-sourced the Hunyuan multimodal generation model, considered Tencent's version of Sora.

Among large model startups, Zhipu AI has been the most agile, releasing the AI video generation tool Qingying in July this year, enabling users to generate 10-second, 4K, 60fps videos from text/images. MiniMax's Hailuo AI also added video generation capabilities in October, supporting the creation of 6-second video clips from text prompts.

The second category comprises those who have resolutely chosen not to emulate.

In contrast to the first category, there are also internet companies and AI startups that have unequivocally opted against following Sora. For instance, after Sora's introduction, Wang Xiaochuan of Baichuan Intelligence stated that although some team members proposed developing a Sora-like model, he clearly stated they would not pursue this direction.

Li Yanhong of Baidu shares a similar sentiment. Despite Baidu's achievements in video generation, his stance against developing a Sora-like product is firm. The reason is that the commercialization of Sora may take five or even ten years, and currently, Baidu is more focused on large language models and multimodal large models, with no plans to productize a Sora-like model.

The third category includes those who have only dabbled.

Additionally, numerous domestic enterprises have made some moves in response to Sora out of FOMO (fear of missing out) but are not prioritizing it, remaining in a state of superficial involvement.

For example, Alibaba's Alimama team released tomoVideo to test the waters of video generation scenarios for e-commerce marketing. Among the "Big Six Tiger AI Startups," Dark Side of the Moon has also launched a video generation model but remains focused on its kimi product. Zero-One Wanwu has ventured into B-end business, but the film and television production industry, the target of video generation models, is currently in an adjustment period, making it challenging for Sora-like products to become a core growth area.

In summary, if the global large model competition is akin to a game of Dou Dizhu (a popular Chinese card game), then the rules are no longer that OpenAI plays a trump card and domestic tech enterprises follow suit. Instead, each enterprise formulates its Sora strategy based on its own resources, business priorities, and objectives.

Why have the rules changed in the large model industry with the advent of Sora?

The performance of domestic tech enterprises indicates a lack of consensus on Sora, with an overall chaotic situation and unclear rules. In a field cloaked in uncertainty, the rules of the game can only be individually explored.

The current state of the video generation field is shrouded in three uncertainties.

Technical Uncertainty: OpenAI believes Sora is a promising path to a world simulator and AGI, but this technology roadmap is currently contentious.

For instance, Li Feifei and Yann LeCun believe Sora cannot achieve AGI. Li Feifei argues that Sora is still a two-dimensional image, and only three-dimensional spatial intelligence can attain AGI. The preview version of Sora showed a generated video of a "Japanese woman walking through the neon-lit streets of Tokyo," but it was impossible to place the camera behind the woman, indicating Sora does not genuinely comprehend the three-dimensional world. Academic giant Yann LeCun also expressed disapproval, stating that it is not a true world model and will still encounter significant bottlenecks akin to GPT-4.

Indeed, even in the official version of Sora, issues like inaccurate hand details and inconsistency during dynamic processes persist.

One reason domestic companies are resolute in not emulating Sora is their reservations about this technology roadmap. For example, Wang Xiaochuan of Baichuan Intelligence believes Sora is merely a phased product, with inferior technical sophistication, breakthroughs, and application value compared to GPT. In summary, the openness of the technical roadmap for achieving AGI and simulating the physical world determines that Sora is not the sole solution.

Commercial Uncertainty: The commercial prospects and ROI of video generation models are unclear in the short term, posing another hurdle discouraging domestic enterprises.

Both the preview and official versions of Sora continue OpenAI's "brute force aesthetics." OpenAI research scientist Noam Brown stated that Sora is the most intuitive demonstration of scale's power, leveraging computational power, data, and parameter volume to attempt to endow large models with an understanding of the physical world. This method is costly and resource-intensive. Whether to emulate Sora depends on each company's commercial expectations and ROI for the model.

If video generation models target ToB charging through APIs or SaaS services, it requires foundational model vendors to invest significant manpower in optimizing business processes and developing interactive interfaces. However, the film and television industry is currently in an adjustment period, limiting the growth of AI film and television production businesses. This increases the opportunity cost for AI enterprises because the same human, material, and computational resources would yield greater returns in areas like financial AI, educational AI, and large-scale government and enterprise projects. Therefore, companies like Baidu and Zero-One Wanwu have positioned video generation as a peripheral business and are not prioritizing it.

In the ToC scenario, on one hand, individuals have a low willingness to pay since video generation is not a high-frequency everyday use scenario, and generation costs and subscription fees are generally higher than text models. Additionally, since Sora models haven't resolved issues of hallucinations and consistency, they may not create practical value, limiting C-end payments' scale. On the other hand, a completely free model, using the video generation model product as a traffic portal for the enterprise, is a business model suitable only for companies with video as their core business.

For example, Kuaishou and ByteDance, which already have core video businesses, can quickly scale their models. Whether targeting C-end users or B-end productivity tools, such enterprises can swiftly integrate and consolidate video generation capabilities with existing products, and the marginal cost of model development will decrease as commercialization scales up.

Overall, for the vast majority of domestic foundational model factories, the field of video generation is relatively peripheral with a low ROI.

The third uncertainty is the competitive uncertainty in the market landscape.

While the commercial prospects of video generation models are currently unclear, could they experience explosive growth in the future, with companies quietly investing and then surprising everyone? This business myth of betting on marginalized racetracks to "hit the jackpot" is unlikely with large models.

Currently, the productization and commercialization prospects of large models are generally vague, and vendors of general models need to swiftly select an option with a higher probability of success and greater market potential from a myriad of unclear products to focus their investments. Among all products, the video generation model is a particularly burdensome and challenging project. In this context, it's essential to prioritize products with a higher success rate and reduce the business priority of video generation models.

From another perspective, even if a company prioritizes video generation models, it may still struggle to establish a competitive advantage. The current market competition for large models differs from the GPT era. Nowadays, companies have accumulated foundations in basic training facilities, core architecture design, and technical reserves. The technical barriers to replicating Sora and launching similar applications are not as high as during the ChatGPT era. This also means that even if a company is the first to release a video generation model, it may not sustain a long-term competitive edge or market monopoly, diminishing the commercial allure of Sora.

Technical uncertainty, commercial uncertainty, and competitive uncertainty still shroud the field of video generation, leading to numerous uncertainties and possibilities in the Sora chess game. It's too early to determine which understanding is correct or which path will ultimately prevail, and each company can only continue playing by its own rules.

Large model technology must evolve, but starting with Sora, domestic tech enterprises are no longer closely mirroring OpenAI but have embarked on their own path.

Specifically, for something as groundbreaking as Sora, domestic enterprises have developed their own understandings and considerations regarding the productization and commercialization of large models, beginning to carve their own playstyles. Emulating Sora showcases strength, while not emulating demonstrates mindset and strategic resolve.

Moreover, while not blindly following products, the narrative prowess of OpenAI is still worth emulating.

Whether it was stealing the spotlight from Google with the launch of Sora in February or the recent official unveiling of the platform, OpenAI has consistently set the pace, generated buzz, and garnered attention—a pivotal skill for capital-intensive AI enterprises.

While it's understandable that not every company will follow in the footsteps of Sora, it's essential that they not overlook crucial technologies.

Take Baidu as an example. Although it doesn't have immediate plans to introduce a product akin to Sora, it hasn't been absent from the realm of pivotal technologies. For instance, Baidu has independently developed a multimodal controllable image generation technology that achieves highly versatile image creation while maintaining consistent entity features. The enhancement of controllability is precisely the linchpin for the next phase of video generation. Furthermore, Baidu hasn't completely shunned the video generation domain and has invested in startups like Shengshu Technology and AI short video drama company Jingying Technology.

Focusing on their core competencies and prioritizing catch-up strategies based on diverse factors like business priorities and commercial considerations, domestic enterprises are discovering their unique pace in the large model game.

Solemnly declare: the copyright of this article belongs to the original author. The reprinted article is only for the purpose of spreading more information. If the author's information is marked incorrectly, please contact us immediately to modify or delete it. Thank you.