Interaction，让大模型从"一问一答"走向"边看边说"

JD.com Open-Sourced Something Interesting: An AI That "Watches" and "Talks" Simultaneously

The current relationship between LLMs and video mostly looks like this: you upload a video to the model, and the model gives you a summary after watching it. What's the problem with this model? The model "talks after watching," not "talks while watching."

JD.com's recently open-sourced JoyAI-VL-Interaction does something different: it enables AI to continuously observe a video stream, proactively determine whether something noteworthy is happening in the video, and respond in real-time.

This capability might not sound sexy, but the practical application space is actually quite large.

"Question-Answer" vs. "Watch-While-Talking"

Current video understanding models mostly use the "question-answer" paradigm:

User: Here's a surveillance video, help me check if there's anything abnormal
Model: (after watching the entire video) OK, found a person climbing over a wall at 3:12

The problem with this paradigm: the user has to wait for the model to finish watching the entire video before getting a result. If the video is very long (like a full day's surveillance footage), this wait time is painful.

JoyAI-VL-Interaction does "watch-while-talking": the model continuously evaluates while watching the video—"hey, something might be happening in this frame"—and proactively alerts the user. No need for the user to ask first; the model knows when to speak up.

Blind Test Win Rates: 77.6% vs. Doubao, 87.9% vs. Gemini

JD.com published some blind test data: in 58 real-person blind evaluations, compared to Doubao's video call assistant, JoyAI-VL-Interaction had a win rate of 77.6%; compared to Gemini's video call assistant, the win rate was 87.9%. In surveillance alert scenarios, the win rate reached 100%.

This data should be treated with caution—what exactly were the blind test criteria? How was the test set constructed? These details aren't clear yet. But even with those uncertainties, an 87.9% win rate (even against Gemini) is quite an eye-catching number.

What Got Open-Sourced?

JD.com was quite thorough with this open-source release, including:

**Model weights**: can be downloaded and used directly
**Interaction dataset**: the dataset used to train the model is also open-sourced, so others can do further research on this basis
**Training recipe**: how the model was trained is also公开的
**Complete deployable system**: not just releasing model weights and calling it a day, but providing a system that can actually be deployed

Supported inputs include: real-time camera feed, live stream, local video files. Interaction supports voice, and also has long-term memory (meaning the model can remember what it told you before, so you don't have to repeat background info every time).

It also supports vLLM-Omni native deployment—vLLM is a popular LLM inference acceleration framework, and Omni is its multimodal version. Native support means deployment is relatively convenient.

What Scenarios Is It Suitable For?

JD.com mentioned a few scenarios when announcing the model:

**Security surveillance**: this is the easiest to understand. A shopping mall or campus monitoring center used to require humans to stare at dozens of screens—now AI can help watch, proactively alerting when something's wrong. Response time goes from "human discovery" to "AI real-time alert"—that's an order-of-magnitude efficiency improvement.

**Elderly care**: safety monitoring for seniors living alone has been a social issue. Using a camera + AI, you can determine in real-time if an elderly person has fallen, has been inactive for an extended period, or if there are abnormal sounds—notify family members immediately if something's wrong.

**Live stream commentary**: this scenario is quite interesting. AI can watch the live stream while automatically generating commentary—for example, in e-commerce live streaming, AI can introduce products in the frame in real-time; or for sports live streaming, AI can provide real-time match commentary.

Why Does This Matter?

I think the value of JD.com's work lies not in "how technically impressive it is" (although there are technical innovations), but in that it defines a new "human-AI interaction paradigm."

In the past, our interaction with AI mostly followed the "question-answer" pattern. Whether text conversation, voice conversation, or video understanding, it's all "human speaks first, AI answers."

But JoyAI-VL-Interaction demonstrates a "AI-proactive" paradigm: AI continuously observes the environment, determining when to intervene, when to alert, when to speak—this paradigm is more like an "assistant" than a "question-answering machine."

I think this direction is quite right. For AI to truly integrate into daily life, it can't just wait for people to "use" it—it needs to proactively sense the environment and offer help.

Of course, there's still some distance from "technical demo" to "reliable product." But I'm bullish on this direction.