Meituan open-sources LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

robot
Abstract generation in progress
ME AI News, according to Data Observation Beating Monitoring, Meituan LongCat team has open-sourced the audio-driven portrait video generation framework LongCat-Video-Avatar 1.5, fully releasing the code and model weights. This upgrade replaces Wav2Vec2 with Whisper-Large audio encoder, aiming to provide stronger long-video identity consistency and broader style generalization capabilities. The framework switches to the Whisper-large-v3 audio encoder to improve lip-sync and lip-shape dynamics. The acoustic representations brought by Whisper-large-v3 significantly enhance the stability of multi-language and cross-language lip generation. To improve temporal stability, the framework adopts multi-segment rolling inference in long video generation to maintain character identity coherence. An inference-side DMD2-based few-step distillation technique is introduced, compressing denoising iterations to 8 steps, accelerating inference to 8 NFE while balancing inference efficiency and image fidelity. Model evaluation was conducted on 508 image-audio source pairs. Crowdsourced evaluation involved 770 raters collecting 13,240 judgments, and 10 experts scored based on physical plausibility, coordination, temporal stability, and identity consistency. The official demonstration compares the framework with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, focusing on improving temporal stability, identity consistency, and natural lip movements. In addition to realistic portraits, the framework can generalize to anime and animal styles, and natively supports mono and multi-channel audio input. Model weights are released under the MIT license. Meanwhile, the project's ethics statement notes that the generated content displayed on the page is for academic use only and not permitted for commercial use. Actual commercial deployment still requires separate verification of weights, code, materials, and content boundaries. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 13
  • 3
  • Share
Comment
Add a comment
Add a comment
ALampInMistyValley
· 57m ago
Who came up with the name LongCat? There must be cat enthusiasts inside Meituan.
View OriginalReply0
GateUser-af0ea0c9
· 05-22 12:35
Commercial use requires separate discussion; it's the old routine of big companies open-sourcing their code.
View OriginalReply0
HedgeHedgeBaby
· 05-22 09:42
Native support for single and multi-channel audio; this is needed for podcast clipping.
View OriginalReply0
LendingRateAnxiety
· 05-22 09:22
Did the 10 experts evaluate what specifically? Is it detailed in the paper?
View OriginalReply0
TheWaveOfRasterization
· 05-22 08:33
MIT License is highly praised, academic-friendly
View OriginalReply0
GlassBottleFeather
· 05-22 08:32
Is DMD2 distillation now standard? It seems like everyone is using it.
View OriginalReply0
ReboundAtTheStreetCornerAfter
· 05-22 08:21
动物风格是什么鬼,猫说话?
Reply0
GateUser-dd8dffab
· 05-22 08:17
Improving identity consistency is very important; previously, changing perspectives easily made it seem like a different person.
View OriginalReply0
GateUser-c29c3db9
· 05-22 08:17
770 evaluators, 13,240 judgments. Is this assessment scale serious?
View OriginalReply0
BridgeTroll
· 05-22 08:17
Anime-style generalization is an easter egg; the secondary creation community is going to be lively.
View OriginalReply0
View More
  • Pinned