AV-Flow: Transforming Text to Audio-Visual Human-like Interactions

Aggelina Chatziagapi¹

Louis-Philippe Morency³

Hongyu Gong³

Michael Zollhöfer²

Dimitris Samaras¹

Alexander Richard²

¹Stony Brook University

²Codec Avatars Lab, Meta

³Meta AI

ICCV 2025

We present AV-Flow, a novel method for joint audio-visual generation of 4D talking avatars, given text input only (e.g. obtained from an LLM). Inter-connected diffusion transformers ensure cross-modal communication, synthesizing synchronized speech, facial motion, and head motion, based on the flow matching objective. AV-Flow further enables empathetic dyadic interactions, by animating an always-on avatar that actively listens and reacts to the audio-visual input of a user.

Abstract

We introduce AV-Flow, an audio-visual generative model that animates photo-realistic 4D talking avatars given only text input. In contrast to prior work that assumes an existing speech signal, we synthesize speech and vision jointly. We demonstrate human-like speech synthesis, synchronized lip motion, lively facial expressions and head pose; all generated from just text characters. The core premise of our approach lies in the architecture of our two parallel diffusion transformers. Intermediate highway connections ensure communication between the audio and visual modalities, and thus, synchronized speech intonation and facial dynamics (e.g., eyebrow motion). Our model is trained with flow matching, leading to expressive results and fast inference. In case of dyadic conversations, AV-Flow produces an always-on avatar, that actively listens and reacts to the audio-visual input of a user. Through extensive experiments, we show that our method outperforms prior work, synthesizing natural-looking 4D talking avatars.

Method

Overview of AV-Flow. Given any input text, our method synthesizes expressive audio-visual 4D talking avatars, jointly generating head and facial dynamics and the corresponding speech signal. Two parallel diffusion transformers with intermediate highway connections ensure communication between the audio and visual modalities. AV-Flow can be additionally conditioned on the audio-visual input of a user, in order to synthesize conversational avatars in dyadic interactions.

Demo

BibTeX

If you find our work useful, please consider citing our paper:

                
        @article{chatziagapi2025avflow,
            title={AV-Flow: Transforming Text to Audio-Visual Human-like Interactions},
            author={Aggelina Chatziagapi and Louis-Philippe Morency and Hongyu Gong and Michael Zollh{\"o}fer and Dimitris Samaras and Alexander Richard},
            year={2025},
            journal={arXiv preprint arXiv:2502.13133},
        }