豆包音频生成模型1.0发布，重新定义AI音频创作

Doubao Released an Audio Generation Model That Lets AI Be a "Sound Director"

For audio generation, people might be familiar with "TTS" (Text-to-Speech)—you input a text, and AI reads it to you.

But Seed-Audio 1.0 (Doubao Audio Generation Model 1.0), recently released by Doubao, isn't simple TTS—it's "audio creation." You give it a text description, and it generates a complete audio piece containing multi-character dialogue, background music, and ambient sound effects.

The difference is a bit like "asking AI to read a text for you" vs. "asking AI to direct a radio drama for you."

One Prompt, Orchestrating the Entire Audio Scene

The most powerful aspect of Seed-Audio 1.0: you can use a single text Prompt to simultaneously control multiple dimensions:

**Multi-character dialogue**: multiple "speakers" in one audio piece, each with their own voice timbre
**Emotion and tone**: is the speaker happy, angry, or hesitant?
**Background music**: what BGM to pair with it
**Environmental atmosphere**: the buzz of a café, outdoor wind sounds, indoor reverb...

In the past, to do these things, you had to generate each character's dialogue separately, then take them into audio editing software for mixing—placing dialogue, BGM, and ambient sounds on separate tracks, adjusting volume and timing, and finally exporting.

Seed-Audio 1.0 does "end-to-end" generation: you give the Prompt, and it directly outputs the final target audio—no need for you to do post-mixing.

"One Voice, Multiple Roles" and Long-Form Consistency

There are two technical challenges here that Doubao claims to have solved.

First is "one voice, multiple roles": in the same audio piece, multiple characters use different timbres, but the model can maintain the coherence of "this is the same story scene." This is quite hard—because the model has to understand the narrative context, knowing "whose turn is it to speak now" and "what should the tone of this line be."

Second is "long-form consistency": when generating longer audio (say, a few minutes of radio drama), the character's voice timbre can't drift—can't sound like one person in the beginning and like someone else by the end. Seed-Audio 1.0 supports generating up to 2 minutes of audio at once; if longer is needed, you can do multiple "extensions" while maintaining voice timbre consistency.

Zero-Shot Multimodal Input

Another practical feature: "zero-shot multimodal input." Meaning: you don't need to fine-tune the model—you can give it images, audio, and text, and it will generate audio based on these inputs.

For example, you upload a photo (say, a photo of a seaside sunset) plus a text description "this audio should feel peaceful and warm," and the model can generate an ambient sound effect + BGM that matches this description.

Or, you give the model a reference audio clip (say, a recording of someone speaking with a crying tone), and ask it to generate a new dialogue audio based on this emotional baseline.

Timbre and Style Decoupling

"Timbre and style decoupling control" is an important technical point. Meaning: timbre (whose voice this is) and style (how this voice speaks—what emotion, what speaking pace) are controlled separately.

Why decouple? Because in real use cases, you often need to "swap timbre without changing emotion" or "swap emotion without changing timbre." If timbre and style are bundled together, flexible combination becomes impossible.

Doubao says Seed-Audio 1.0 achieves this decoupling, giving users more flexible control over generation results.

Where Can It Be Used?

This capability might seem "content-creation-specific," but the application scenarios are actually quite broad:

**Short video voiceover**: creating Douyin, Kuaishou, Xiaohongshu short videos today requires massive amounts of voiceover. Seed-Audio lets creators use AI to generate multi-character dialogue, pair BGM, add ambient sound effects—all in one go.
**Audiobooks/radio dramas**: this is a natural scenario. In the past, making audiobooks required hiring voice actors and doing post-production; now AI can generate multi-character dialogue, dramatically improving efficiency.
**Educational content**: language learning, children's stories—these all need engaging audio content.
**Game and film pre-visualization**: during early-stage game development or film production, AI can quickly generate draft voiceovers for teams to evaluate.

Invite-Only Testing Already Open, Coming to Jianying Soon

Seed-Audio 1.0 has already opened invite-only testing for the Volcano Ark API. Individual users get 30 minutes of creation quota (free).

Doubao also said this feature is coming soon to ByteDance products like Jianying (video editing app), Jimeng (AI image generation), and Fanqie (novel reading). This means ordinary users will soon be able to access this capability directly within commonly used creation tools.

For content creators, the imagination space for this tool is quite large. Especially for scenarios requiring batch audio content production (say, accounts that post several voiced short videos daily), AI audio generation can significantly reduce production costs and time.

Of course, whether the final generation quality meets professional requirements remains to be seen through actual use.