Stay ahead of the curve with our daily and weekly newsletters, packed with the latest updates and exclusive insights on the AI industry. Discover More
Stability AI is pushing the boundaries of generative AI models, introducing a new dimension with the launch of Stable Video 4D.
While there are numerous gen AI tools for video generation, such as OpenAI’s Sora, Runway, Haiper, and Luma AI, Stable Video 4D offers something unique. It builds on Stability AI’s existing Stable Video Diffusion model, which transforms images into videos. The new model takes this idea to the next level by accepting video input and generating multiple novel-view videos from 8 different perspectives.
“We envision Stable Video 4D being utilized in movie production, gaming, AR/VR, and other scenarios where there is a demand to view dynamically moving 3D objects from various camera angles,” said Varun Jampani, Team Lead, 3D Research at Stability AI, in a conversation with VentureBeat.
Stable Video 4D: More than just 3D for gen AI
This isn’t Stability AI’s first venture into the realm beyond the 2D space.
In March, they announced Stable Video 3D, which allows users to generate short 3D videos from an image or text prompt. Stable Video 4D takes this a significant step further. While 3D, or three dimensions, is commonly understood as a type of image or video with depth, 4D isn’t as universally comprehended.
Jampani clarified that the four dimensions include width (x), height (y), depth (z) and time (t). This means Stable Video 4D can view a moving 3D object from various camera angles and at different points in time.
“The key factors that enabled Stable Video 4D are the combination of our previously-released Stable Video Diffusion and Stable Video 3D models, fine-tuned with a meticulously curated dynamic 3D object dataset,” Jampani explained.
Jampani highlighted that Stable Video 4D is a pioneering network where a single network performs both novel view synthesis and video generation. Existing works use separate video generation and novel view synthesis networks for this task.
He further explained that Stable Video 4D differs from Stable Video Diffusion and Stable Video 3D in terms of how the attention mechanisms function.
“We have carefully designed attention mechanisms in the diffusion network which allow each video frame to attend to its neighbors at different camera views or timestamps, resulting in improved 3D coherence and temporal smoothness in the output videos,” Jampani said.
How Stable Video 4D operates differently than gen AI infill
With gen AI tools for 2D image generation, the concept of infill and outfill, to fill in gaps, is well established. However, this is not how Stable Video 4D operates.
Jampani explained that their approach differs from generative infill/outfill, where the networks typically complete the partially given information. That is, the output is already partially filled by the explicit transfer of information from the input image.
“Stable Video 4D synthesizes the 8 novel view videos from scratch, using the original input video as guidance,” he said. “There is no explicit transfer of pixel information from input to output, all of this information transfer is done implicitly by the network.”
Stable Video 4D is currently available for research evaluation on Hugging Face. Stability AI has not yet announced what commercial options will be available for it in the future.
“Stable Video 4D can already process single-object videos of several seconds with a plain background,” Jampani said. “We plan to expand its capabilities to longer videos and also to more complex scenes.”