Meituan LongCat team open-sourced LongCat-Video-Avatar 1.5, fully releasing the code and weights. Switched to Whisper-large-v3 to improve multilingual lip-sync and style generalization, using multi-segment rolling inference and DMD2-based few-step distillation to reduce inference to 8 steps, balancing speed and fidelity. After 508 source data pairs, 770 evaluators with 13,240 judgments, and evaluations by 10 experts, it significantly improves temporal stability, identity consistency, and natural lip movements, and can generalize to anime and animal styles. It natively supports single/multi-channel audio. MIT License, primarily for academic use; commercial use requires further review.

MeNews

2026-05-22 08:04:01

Abstract generation in progress

ME AI News, according to Data Observation Beating Monitoring, Meituan LongCat team has open-sourced the audio-driven portrait video generation framework LongCat-Video-Avatar 1.5, fully releasing the code and model weights. This upgrade replaces Wav2Vec2 with Whisper-Large audio encoder, aiming to provide stronger long-video identity consistency and broader style generalization capabilities. The framework switches to the Whisper-large-v3 audio encoder to improve lip-sync and lip-shape dynamics. The acoustic representations brought by Whisper-large-v3 significantly enhance the stability of multi-language and cross-language lip generation. To improve temporal stability, the framework adopts multi-segment rolling inference in long video generation to maintain character identity coherence. An inference-side DMD2-based few-step distillation technique is introduced, compressing denoising iterations to 8 steps, accelerating inference to 8 NFE while balancing inference efficiency and image fidelity. Model evaluation was conducted on 508 image-audio source pairs. Crowdsourced evaluation involved 770 raters collecting 13,240 judgments, and 10 experts scored based on physical plausibility, coordination, temporal stability, and identity consistency. The official demonstration compares the framework with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, focusing on improving temporal stability, identity consistency, and natural lip movements. In addition to realistic portraits, the framework can generalize to anime and animal styles, and natively supports mono and multi-channel audio input. Model weights are released under the MIT license. Meanwhile, the project's ethics statement notes that the generated content displayed on the page is for academic use only and not permitted for commercial use. Actual commercial deployment still requires separate verification of weights, code, materials, and content boundaries. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

12 Likes

Reward
12
13
3
Share

Comment

Add a comment

ALampInMistyValley

· 57m ago

Who came up with the name LongCat? There must be cat enthusiasts inside Meituan.

View OriginalReply0

GateUser-af0ea0c9

· 05-22 12:35

Commercial use requires separate discussion; it's the old routine of big companies open-sourcing their code.

View OriginalReply0

HedgeHedgeBaby

· 05-22 09:42

Native support for single and multi-channel audio; this is needed for podcast clipping.

View OriginalReply0

LendingRateAnxiety

· 05-22 09:22

Did the 10 experts evaluate what specifically? Is it detailed in the paper?

View OriginalReply0

TheWaveOfRasterization

· 05-22 08:33

MIT License is highly praised, academic-friendly

View OriginalReply0

GlassBottleFeather

· 05-22 08:32

Is DMD2 distillation now standard? It seems like everyone is using it.

View OriginalReply0

ReboundAtTheStreetCornerAfter

· 05-22 08:21

动物风格是什么鬼，猫说话？

Reply0

GateUser-dd8dffab

· 05-22 08:17

Improving identity consistency is very important; previously, changing perspectives easily made it seem like a different person.

View OriginalReply0

GateUser-c29c3db9

· 05-22 08:17

770 evaluators, 13,240 judgments. Is this assessment scale serious?

View OriginalReply0

BridgeTroll

· 05-22 08:17

Anime-style generalization is an easter egg; the secondary creation community is going to be lively.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
317.92K Popularity
#
PlatinumCardCreatorExclusive
114.13K Popularity
#
DailyPolymarketHotspot
1.04M Popularity
#
GateSquarePizzaDay
647.66K Popularity
#
SpaceXOfficiallyFilesforIPO
566.59K Popularity

Pinned

Sitemap

Meituan open-sources LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

Trending Topics

TradfiTradingChallenge

PlatinumCardCreatorExclusive

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned