AI�����鱨վ
AI AI工具情报站
China AI ai-models

让 Agent 学会"先预测,再行动"

公众号:通义实验室(千问) 2026-06-24

Alibaba's Qwen Team Built a "World Model" That Lets Agents "Simulate" the Future in Their Head

Training Agents with LLMs has a persistent, thorny problem: letting Agents try things out in real environments is expensive and sometimes downright dangerous.

For example, say you want to train an Agent that can operate a computer. If you let it freely experiment in a real environment, it might accidentally delete files, send emails to the wrong person, or crash the system. You can't exactly give it admin privileges and let it run wild in production, right?

Alibaba's Qwen team recently released Qwen-AgentWorld, proposing a pretty clever solution: first train a "World Model"—let the Agent "think through" actions in a simulated environment before diving into the real world, rather than blindly trial-and-erroring in production.

What Is a "Language World Model"?

The concept of "World Model" originated in reinforcement learning. The core idea: if AI can learn to "predict the next state," it can first simulate "what will happen if I do this" before taking action.

For example, you ask an Agent to book a flight for you. A traditional Agent's approach: go directly to the airline website, pick dates, pick flights, fill in passenger info, pay—if anything goes wrong at any step, it might have to start over, or worse, make an erroneous operation.

If there's a "World Model," the Agent can first "simulate" the entire booking process in its head—"if I click this button, which page will it go to? If I fill in this info, how will the system respond?"—and only execute in the real environment after confirming everything looks right.

Alibaba's Qwen-AgentWorld does exactly this—except it doesn't use visual simulation (like generating the next video frame), but uses "language" to model the world. Environment states and changes are all described in text.

This choice is actually quite pragmatic. Because many Agent tasks (like operating a computer, searching for information, calling APIs) are essentially "text in, text out"—using language to model the world is sufficient.

Built on 10 Million Real Interaction Trajectories

Qwen-AgentWorld is trained on over 10 million real environment interaction trajectories. Data sources include: MCP (Model Context Protocol) interactions, search behavior, terminal operations, software engineering tasks, Web browsing, OS operations, Android operations—seven domains in total.

Training happens in three stages:

1. **CPT (Continued Pre-Training)**: "feed" interaction trajectories to the model, letting it learn "how does the environment work"

2. **SFT (Supervised Fine-Tuning)**: teach the model "given the current state, what should the next state be"

3. **RL (Reinforcement Learning)**: let the model trial-and-error in a simulated environment, adjusting strategy based on whether the result is good or bad

This CPT→SFT→RL three-stage training is a technical innovation from the Qwen team. They found that if "environment modeling" is made a training objective during pre-training (rather than retrofitting it after general LLM training is done), the model's world modeling capability is significantly stronger.

Evaluation Results: Beat GPT-5.4 and Claude Opus 4.8

On the AgentWorldBench evaluation benchmark, Qwen-AgentWorld-397B-A17B (the largest version) scored 58.71, surpassing GPT-5.4's 58.25 and also beating Claude Opus 4.8.

The score difference looks small, but in Agent evaluation, a 0.5-point gap often means "making significantly fewer mistakes in real-world scenarios."

More interesting is the small model's performance. Qwen-AgentWorld-35B-A3B (a relatively small version) saw an overall average score improvement of 8.66 points after three-stage training—this improvement magnitude suggests that world modeling capability is "transferable": you don't need to retrain for every task; a general world model can cover many scenarios.

Two Application Paradigms

The Qwen team also explored two ways to apply world models in Agent training:

**First: as a decoupled environment simulator.** Meaning: Agent training can happen not in a real environment, but in the world model's "simulated environment." This has two benefits: one, safety (won't break real systems); two, efficiency (many Agents can run in parallel in a simulated environment, which is impossible in a real environment). The team ran experiments on the WideSearch task: Agents trained in a simulated environment (F1 50.3%) outperformed Agents trained in a real environment (F1 45.6%).

**Second: as an Agent foundation model.** Meaning: you can use the world model to do "warm-up training" for an Agent first (letting the Agent first learn "how does the environment work"), then do fine-tuning on specific tasks. The team found that after LWM warm-up, Agent performance improved on all seven evaluation benchmarks—and three of those benchmarks were "completely unseen during training." This suggests the world model has strong generalization capability.

Open-Sourced and Free to Use

The Qwen-AgentWorld model and AgentWorldBench evaluation benchmark are both open-sourced on Hugging Face and ModelScope, free to use.

For teams doing Agent R&D, this open-source release is quite valuable—previously, if you wanted to train an Agent that could trial-and-error in real environments, the cost was high; now with this world model, you can first train in a simulated environment, and only deploy to a real environment after training is reasonably complete.

The impact on the entire Agent field might be: lowering the barrier and cost of Agent training, enabling more small teams to train reliable Agents.

Source: 公众号:通义实验室(千问)

View original article