Tencent Hybrid Voyager3D World Model Released-Taibo

On September 2nd, Hunyuan World Voyager, the latest member of the hybrid 3D world model series, was officially released. This is also the industry's first ultra long roaming world model that supports native 3D reconstruction.

This model focuses on the application expansion of AI in the field of spatial intelligence, and will provide high fidelity 3D scene roaming capabilities for virtual reality, physical simulation, game development, and other fields.

Hybrid Voyager breaks through the limitations of traditional video generation in terms of spatial consistency and exploration range, and can generate long-distance, world consistent roaming scenes. It supports exporting videos directly to 3D format.

The hybrid Voyager 3D input-output feature is highly compatible with the previously open source hybrid world model 1.0, which can further expand the roaming range of the 1.0 model, improve the generation quality of complex scenes, and enable stylized control and editing of the generated scenes.

Not only that, Hybrid Voyager can also support various 3D understanding and generation applications such as video scene reconstruction, 3D object texture generation, customized video style generation, video depth estimation, etc., demonstrating the potential of spatial intelligence.

Currently, interactive video models have shown potential in world model generation. However, in practical applications such as virtual reality, physical simulation, etc., explicit 3D scenes that can be modeled are often required, and pure video generated content is difficult to provide users with more realistic interactive forms.

On the other hand, directly generating world scenes in 3D format, although having better spatial consistency and scalability for interactive applications, is limited by the scarcity of 3D training data and low memory efficiency for 3D representation, and cannot generalize to more categories and larger scenes.

The hybrid Voyager framework innovatively introduces scene depth prediction into the video generation process, integrating the advantages of video generation and 3D modeling. Based on camera controllable video generation technology, RGB-D videos (point cloud videos containing RGB images and depth information) can be synthesized from the initial scene view and user specified camera trajectory, which can freely control the viewing angle and spatial continuity.

Users can generate corresponding video images through keyboard or joystick control, and maintain high consistency of the images through 3D spatial memory, achieving the same functionality as interactive video models such as Genie3. At the same time, Voyager also supports lossless export of generated videos to 3D point clouds, without relying on additional reconstruction tools such as COLMAP.

For the first time, Hybrid Voyager supports native 3D memory and scene reconstruction through the combination of space and features, avoiding the delay and accuracy loss caused by traditional post-processing. At the same time, adding 3D conditions to the input ensures accurate viewing angles, while the output directly generates 3D point clouds, suitable for various application scenarios. The additional depth information can also support functions such as video scene reconstruction, 3D object texture generation, stylized editing, and depth estimation.

Voyager also introduced an extensible world caching mechanism, based on the initial 3D point cloud cache generated by the 1.0 model, which is projected onto the target camera view to provide guidance for the diffusion model. In addition, the generated video frames will update the cache in real-time, forming a closed-loop system that supports any camera trajectory while maintaining geometric consistency. This not only expands the roaming range, but also adds new perspective content to the 1.0 model, improving the overall generation quality.

This model ranks first in comprehensive ability in the WorldScore, a world model benchmark test released by the team led by Feifei Li at Stanford University, surpassing existing open source methods and performing well in both video generation and 3D reconstruction tasks. Voyager also achieved better results in both video generation and video 3D reconstruction tasks.

The open source of Tencent's hybrid world model series is constantly accelerating. In July, the Hybrid 3D World Model 1.0 was released and open sourced, becoming the industry's first roaming world generation model compatible with traditional CG pipelines. In August, the 1.0 Lite version was launched to reduce the demand for video memory and support the deployment of consumer grade graphics cards. Only two weeks later, in response to the limitations of occluded views and exploration range, the hybrid team further optimized and launched the ultra long roaming world model Voyager.

Previously, Hunyuan has successively opened up the industry's leading capabilities in generating text, video, and 3D models, providing open source models with performance close to commercial models. The download volume of Hunyuan's 3D series of open source models ranks first in the open source community.

In terms of basic models, Hybrid Open Source has opened up representative models of MoE architecture, such as Hybrid Large and Hunyuan-A13b, as well as multiple small-sized models for end-to-end scenarios, with a minimum of only 0.5B parameters. The latest open-source translation model Hunyuan-MT-7B has won 30 championships in 31 languages in international translation competitions.