Two Stanford PhDs bring alternative world models, which have received investment from giants such as Nvidia

DeepTech 26 Feb 2026 18:30

Recently, Moonlake AI, a startup based in San Francisco, announced the public testing of its "World Modeling Agent". The accompanying technology blog showcased the ten stage construction process of a bowling mini game, from grid asset generation, rigid body physics assignment, collision detection, scoring logic, audio integration to inverse kinematics (IK) grasping animation, all independently completed by AI. Moonlake's beta product can transform a natural language description into a complete game prototype that is runnable, controllable, and has physical feedback within 15 to 20 minutes.

Moonlake attempts to enable anyone to directly generate a complete interactive world with physics engines, game logic, collision detection, scoring systems, and even spatial audio through natural language. In the words of the founders themselves, they are not just making a game generator, but want to use the process of game development to train a cutting-edge AI model about how the world works.

The company is headquartered in San Francisco and was co founded by two PhDs from Stanford AI Laboratory, Fan Yun Sun and Sharon Lee. During his PhD at Stanford, Fan Yun Sun worked in both Nvidia's learning and perception research team and Metropolis' deep learning department (i.e. Omniverse related projects), focusing on training AI agents to generate large-scale 3D worlds.

Lee's research direction is to combine diffusion models with 3D engines to construct fundamental models that can understand space. Their research backgrounds are highly complementary, one addressing "how to generate the world" and the other addressing "how to make the world interactive".

图丨 Fan-Yun Sun(右)和 Sharon Lee(左)(来源:Moonlake AI)图丨 Fan-Yun Sun(右)和 Sharon Lee(左)(来源:Moonlake AI)

The company will exit stealth mode in October 2025 and announced the completion of a $28 million seed round financing, led by AIX Ventures, Threshold Ventures, and Nvidia's venture capital firm NVVentures.

The lineup of angel investors is also quite impressive: YouTube co-founder Steve Chen, AngelList founder Naval Ravikant, Google Chief Scientist Jeff Dean, GAN (Generative Adversarial Network) inventor Ian Goodfellow, as well as several executives from Hugging Face, DeepMind, Stability AI, and OpenAI. As of recently, the financing figure has been updated to approximately $30 million, with a team of around 15 people consisting of ACM ICPC medalists and international Olympic competition winners.

In December 2025, Moonlake released their core product Reverie, also known as GGE (Generative Game Engine). According to the company, this is the first "programmable world model" for real-time interactive content generation. It has a key difference from previous AI video generation models: state persistence.

Most video generation models (such as Sora) can produce beautiful visuals, but they are essentially predicting what the next frame should look like, without maintaining a true world state. If a player breaks a vase in the game, it may return to its original shape after a few seconds.

Moonlake's system binds diffusion models with structured 3D signals to sustain changes occurring in the world. Lee once said in an interview that the missing piece of the puzzle in the generative world is "control", where creators need to be able to define what changes, why changes, and how long changes last.

Specifically, Moonlake's technical architecture is not a single model, but an "orchestrator". After the user's natural language instructions enter, the system calls a set of specialized third-party AI models to handle different tasks separately: spatial layout uses multimodal reasoning, game logic relies on program synthesis, physical interaction uses simulation layers, and visual rendering is completed by real-time diffusion models.

Fan Yun Sun described to Fast Company that their orchestration model will learn how to integrate these modalities over time and gradually incorporate the capabilities of external models into its own body.

Moonlake provided a very specific case in the official blog to demonstrate the reasoning process of this system: a cyberpunk style bowling mini game. The user only gave a one sentence prompt: 'Create a cyberpunk aesthetic, semi realistic style bowling mini game in the street computer room', without providing any architectural constraints or implementation details.

Next, the system's agent automatically completes ten stages: first, asset instantiation, generating 3D meshes and PBR (Physically Based Rendering) textures for the lanes, bottles, and bowling balls; Then there is physicalization, converting the ball bottle into a rigid body, giving it a friction coefficient of 0.4, elasticity of 0.15, a mass of 1.5 kilograms for the ball bottle, and 5 kilograms for the bowling ball; From spatial layout, core game logic, ball lifecycle management, boundary stability, edge condition processing, audio integration, IK (Inverse Kinematics) arm grasping system, and finally, detail polishing driven by user feedback.

From this example, it can also be seen that Moonlake's definition of "world model" is significantly different from the mainstream discourse in the current AI community. In the past year, the term 'world model' has been widely used in the AI industry, but most of the time it refers to predicting the next frame of video, that is, given the current screen and user actions, predicting what the next visual should be like.

Google DeepMind's Genie 3 will be released in August 2025, capable of generating navigable 3D environments at 24 frames per second; Li Feifei's World Labs will launch Marble in November 2025, which can generate downloadable 3D worlds from text, images, or videos.

Moonlake's approach is quite different from the above. In their view, the state of a world cannot be reduced to a frame of image or a cluster of pixels.

Their blog post used a bowling pin as an example: a bowling pin is simultaneously a textured object in space, a rigid body with mass and inertia, an object that can be knocked down, a symbolic entity that contributes to the score, and a sound source at the time of impact. At the moment when the ball hits the bottle, the transformation matrix updates, the physics solver analyzes the collision impulse, the score increases, the audio triggers, the reset timer advances, etc. These are not independent events, but synchronous results of the same causal event. If any of these modes update while others don't keep up, the world becomes disjointed.

So Moonlake pursues cross modal causal consistency, rather than simply visual realism. They divided the things that the world model should encode simultaneously into five dimensions: geometry (transformation, topology, spatial relationships), physics (mass, force, collision constraints), affordance (i.e. what actions are possible and who performs them), symbolic logic (rules, fractions, timers, state machines), and perceptual mapping (visual projection and spatial audio). This framework is more comprehensive than pure visual world models and closer to what traditional game engines actually do.

(来源:Moonlake AI)(来源:Moonlake AI)

Based on the current actual product experience, it is indeed possible to quickly create a simple game prototype, but it still requires a significant amount of effort to polish. In the testing reported by Fast Company, the reporter encountered a failure when attempting to create a 3D dungeon adventure game for the first time, resulting in a single room filled with capsule shaped characters.

Afterwards, he narrowed down the scope and created a 2D ice cream stacking game, and the first version was released within 15 to 20 minutes. The core gameplay is basically in place, the rhythm of ice cream falling from the sky is just right, the keyboard control mapping is also automatically completed, and the system even actively adds a bouncing animation when ice cream falls onto the cone. But the chef is a rough white figure, and the ice cream won't stack correctly.

So he spent several hours repeatedly communicating with the AI to fix the physical effects, falling into a cycle of "almost solved but not completely solved". In the end, he poured the remaining requirements into the system and received the complete game with scoring and Game Over graphics 15 minutes later, consuming about 950 points out of 1500 monthly credit points, which is less than $25 based on a monthly fee of $40. The speed is amazing, but the polishing is still laborious.

However, Moonlake's real long-term bet is not at the tool level. Lee and Fan Yun Sun repeatedly emphasized that every time users correct the physical behavior of the system, supplement game rules, and adjust causal relationships on the platform, they provide training signals for Moonlake's own multimodal model.

Fan Yun Sun compared this with existing methods of collecting world data, such as renting Airbnb and using laser scanning to scan rooms, which are static and difficult to scale up; Analyzing videos lacks human context; A model trained solely on single game data (such as a large number of Fortnite videos) will not generalize to the real world.

The user interaction on Moonlake naturally carries intention and feedback, and is causal data. If this flywheel runs, the data scale will grow exponentially and the model will become stronger accordingly. After the game, their envisioned application directions include robot training, autonomous driving, and human factors analysis in manufacturing. Lee said they have received inquiries from manufacturing companies.

However, currently its beta version only has a daily output of 100 people, and there is still a considerable distance to go before it starts spinning.

Most Popular From TAIBO