From Natural Alignment to Conditional Controllability in Multimodal Dialogue

Anonymous

Overview

MM-Dia Teaser

Left: Example of a movie dialogue clip with sentence- and dialogue-level annotations in the MM-Dia and MM-Dia-Bench datasets, highlighting multimodal dialogue interaction details.
Right: Three dialogue-related cross-modal generation tasks involving text, audio, and vision, with both explicit (Task 1) and implicit control (Task 2, Task 3).

Abstract

The recent advancement of Artificial Intelligence Generated Content (AIGC) has led to significant strides in modeling human interaction, particularly in the context of multi-modal dialogue. While current methods impressively generates realistic dialogue in speech and vision modalities, challenges remain in multi-modal conditional dialogue generation. This paper focuses on the natural alignment between speech, vision, and text in human interaction, aiming at expressive dialogue generation through multi-modal conditional control. Since existing datasets lack the richness and diversity in dialogue expressiveness, we introduce a novel multi-modal dialogue annotation pipeline to exploit meaningful dialogues from movies and TV series with fine-grained annotations across multi-modalities. The resultant dataset, MM-Dia, provides over 360 hours and 54,700 dialogues, facilitating the Multimodal Dialogue Generation task through explicit control over style-controllable dialogue speech synthesis. While the proposed benchmark, MM-Dia-Bench, containing 309 dialogues that are highly expressive with visible dual/single speaker scenes, supports the evaluation of implicit cross-modal control through downstream multi-modal dialogue generation tasks to assess the audio-visual style consistency across modalities. Our experiments demonstrate the effectiveness of our data in enhancing style controllability and reveal limitations in current frameworks' ability to replicate human interaction expressiveness, providing new insights and challenges for multi-modal conditional dialogue generation.

MM-Dia & MM-Dia-Bench

MM-Dia Pipeline

Data curation pipeline. Framework of the Movie/TV-sourced in-the-wild data curation pipeline for multi-modal dialogue extraction with fine-grained interaction-level annotations, including versions of dialogue Affective Triplet and Description. MM-Dia is the first dataset to specifically center on dialogue expressiveness across multiple modalities.

A descriptive alt text for the image

Upper Left: Detailed statistics of MM-Dia and MM-Dia-Bench. While both corpus contain highly-expressvie dialogue clips, MM-Dia-Bench is further filtered to ensure the presence of visible dual speakers with strong emotional expression.
Lower Left: Word cloud of non-verbal annotations. MM-Dia consists of diverse annotation in non-verbal audio sounds, due to the rich source of scenarios from movies and TV series.
Right: Relationship × Interaction Sunburst Chart. The dialogue affective triplet annotations are further categorized into 8 relationship types and 12 interaction types, presenting a balanced distribution and comprehensive understanding of dialogue dynamics.

Task 1: Style-Controllable Dialogue Speech Synthesis


1. In-domain dialogue speech synthesis with Description as style prompt.

Style Prompt Transcript Synthesized Dialogue Speech
(Higgs Audio v2)
Synthesized Dialogue Speech
(Trained on MM-Dia)
The Mother Abbess delivers her command with a calm finality, prompting Maria to gasp and then plead in a desperate, tumbling rush of words. [Mother Abbess] Maria. It seems to be the will of God that you leave us.
[Maria] Leave?
[Mother Abbess] Only for a while, Maria.
[Maria] Oh, no, Mother, please don't do that. Don't send me away. This is where I belong. It’s my home, my family. It’s my life.
[Mother Abbess] But are you truly ready for it?
[Maria] Yes, I am.
Forrest's steady, simple questions are met with Jenny's quiet, sad deflections, creating a conversation full of heavy, painful pauses. [Forrest] Will you marry me? I'd make a good husband, Jenny.
[Jenny] You would, Forrest.
[Forrest] But you won't marry me.
[Jenny] You don't want to marry me.
[Forrest] Why don't you love me, Jenny?
Friends Willy and Rusty erupt in joyous, loud shouts, feeding off each other's gleeful excitement. [Grover Cleaver] Willard, clear my salad plate!
[Willard Filmore] Yeah, well, lick me, you old bag!
[Grover Cleaver] Your busboy's gone to Bermuda! We're going to Bali, so I can date her.
[Willard Filmore] Yeah!
[Grover Cleaver] Get some Bali booty. ♪ I'm gonna get some Bali booty ♪
Areida persistently questions a defensive Ella, their denials overlapping quickly until Ella finally gives in with an annoyed huff. [Areida] Where were you?
[Ella] Oh. I met the prince.
[Areida] You met the prince?
[Ella] I don't wanna talk about it. Olive and Hattie were there.
[Areida] Hattie. Why do you always do what Hattie tells you to do?
[Ella] I don't.
[Areida] Yes, you do.
[Ella] I don't.
[Areida] Tell the truth.
[Ella] Oh, I do.
Monica speaks with frantic, escalating panic and accusation, while Ethan responds with calm amusement and playful defiance. [Monica] What we did was wrong. oh, god. I just had sex with someone who wasn't alive during the bicentennial. i just had sex. Ethan, focus. how could you not tell me?
[Young Ethan] You never told me how old you were.
[Monica] Well, that's different. My lie didn't make one of us a felon in 48 states. What were you thinking?
[Young Ethan] I wasn't. I was too busy falling...
[Monica] Don't say it.
[Young Ethan] in love with you.
[Monica] Well, fall out of it. You know, you shouldn't even be here. It's a school night. Oh, god. Oh, god. I'm like those women that you see with shiny guys named chad. I'm joan collins.


2. Out-of-domain dialogue speech synthesis with Affectie Triplet as style prompt, with Variable Control.

Style Prompt Transcript Synthesized Dialogue Speech
(Higgs Audio v2)
Synthesized Dialogue Speech
(Trained on MM-Dia)
Relationship: Lovers
Interaction Type: Frequent interruptions
Emotional State: Irritated impatience
[SPEAKER0] You don’t understand, you keep twisting my words—just let me explain.
[SPEAKER1] Oh, here we go again, another excuse. I’ve heard it all before.
[SPEAKER0] Please, listen, if you’d just give me a second, I can clear this up.
[SPEAKER1] No, you always say that, and it never changes anything.
Relationship: Employer-employee
Interaction Type: Frequent interruptions
Emotional State: Irritated impatience
[SPEAKER0] This report is full of mistakes. Do you even review your work?
[SPEAKER1] Please, listen, if you’d just give me a second, I can clear this up.
[SPEAKER0] I don’t need excuses, I need results.
[SPEAKER1] It’s not an excuse, it’s context you don’t have yet.
Relationship: Police-Criminal
Interaction Type: Frequent interruptions
Emotional State: Irritated impatience
[SPEAKER0] We’ve got your fingerprints at the scene—stop wasting my time.
[SPEAKER1] Please, listen, if you’d just give me a second, I can clear this up.
[SPEAKER0] Every suspect says that. The –evidence is already against you.
[SPEAKER1] You don’t have the whole story, and that’s what I need to tell you.
Relationship: Friends
Interaction Type: Questioning
Emotional State: Irritated impatience
[SPEAKER0] How could you make such a mistake! You’re always an expert on it.
[SPEAKER1] Because things went wrong faster than I could fix them, that’s why.
[SPEAKER0] So what, you just forgot everything you brag about knowing?
[SPEAKER1] No, I just didn’t expect you to jump on me instead of helping.
Relationship: Friends
Interaction Type: Sarcasm
Emotional State: Irritated impatience
[Monica] What we did was wrong. oh, god. I just had sex with someone who wasn't alive during the bicentennial. i just had sex. Ethan, focus. how could you not tell me? [SPEAKER0] You never just let things go, do you?
[SPEAKER1] You‘re always an expert on it, aren’t you?
[SPEAKER0] I’m serious—can’t you stop acting like you know everything?
[SPEAKER1] Maybe if you stopped nagging, I wouldn’t have to!
Relationship: Friends
Interaction Type: Comforting
Emotional State: Irritated impatience
[SPEAKER0] I can’t keep doing this anymore… it’s exhausting.
[SPEAKER1] Don’t give up—come on, you’re always an expert on it.
[SPEAKER0] That’s not helping, you don’t understand how draining this is.
[SPEAKER1] I do understand, but if you just push through a little longer, you’ll make it.
Relationship: Friends
Interaction Type: Chat
Emotional State: Excited and then cried with joy
[SPEAKER0] Guess what, the tickets finally came through!
[SPEAKER1] I’ve been waiting for this moment forever.
[SPEAKER0] I can tell, you’re practically glowing.
[SPEAKER1] It’s everything I’ve dreamed about...
Relationship: Friends
Interaction Type: Chat
Emotional State: Calm, then turned into sadness
[SPEAKER0] Guess what, the tickets finally came through!
[SPEAKER1] I’ve been waiting for this moment forever.
[SPEAKER0] I can tell, you’re practically glowing.
[SPEAKER1] It’s everything I’ve dreamed about...
Relationship: Friends
Interaction Type: Chat
Emotional State: Growing frustration to anger
[SPEAKER0] Why are you making such a big deal about it?
[SPEAKER1] You don’t get it—I’ve been waiting for this moment forever.
[SPEAKER0] Okay, but you don’t have to snap at me.
[SPEAKER1] I wouldn’t if you took me seriously instead of brushing it off.

Task 2: Vision-Conditioned Dialogue Speech Synthesis


Dialogue speech are generated by: 1. HarmoniVox, 2. Cascaded GPT + Higgs-Audio-SFT, and 3. Cascaded Gemini + Higgs-Audio-SFT

Original Video The Conditioned Keyframe Sequence with Syhthesized Spoken Dialogue
MM-Dia Pipeline
MM-Dia Pipeline
MM-Dia Pipeline
MM-Dia Pipeline

Task 3: Dialogue Video Generation


SI2V (Speaker-Image-to-Video) Task. Dialogue videos are generated by: 0. Original Video, 1. FLOAT, 2. Wan2-2 S2V, 3. MultiTalk, 4. Sonic


T2V (Text-to-Video) Task. Dialogue videos are generated by: 0. Original Video, 1. HunyuanVideo, 2. Wan2-2 S2V

Original Video

Prompt

Charlie Young (Male, Middle-aged) and Harper Moore (Female, Teenager).
Social banter with intimate tone.
Summary: Charlie lands a sharp, direct insult, which Harper playfully parries with a smile, kicking off a rapid and flirtatious exchange of witty jabs.
Emotions: flirtatious banter, witty repartee, playful teasing.
Keep identity consistent; lip-sync to audio.
Lines:
Charlie Young: "You're a know-it-all."
Harper Moore: "You are unbelievably bad at beer pong."
Charlie Young: "You're a sore winner. And you use too many exclamation points."

FLOAT

Wan2-2 S2V

Original Video

Prompt

Midge (Female, Youth adult) and Susie (Female, Middle-aged).
Emotion release with workplace tone.
Summary: A well-dressed Midge Maisel delivers a tense confession, prompting her gruff manager Susie Myerson to immediately erupt with a furious, finger-pointing command.
Emotions: Manager-Client Confrontation, Confession and Warning, Sudden Anger.
Keep identity consistent; lip-sync to audio.
Lines:
Midge: "I recorded you when you were sleeping last night."
Susie: "First of all, don't ever, ever touch the tape recorder. Hear me?"

FLOAT

Wan2-2 S2V

Original Video

Prompt

Robin Scherbatsky (Female, Youth adult) and Lily Aldrin (Female, Youth adult).
Social banter with friends tone.
Summary: The flustered news anchor Robin Scherbatsky vents with wide-eyed panic, while her friend, the quick-witted kindergarten teacher Lily Aldrin, calmly teases her with a deadpan comeback..
Emotions: Supportive friends, Venting and teasing, Agitation meets amusement.
Keep identity consistent; lip-sync to audio.
Lines:
Robin Scherbatsky: "That was so unprofessional!I said "nipple" on the news!"
Lily Aldrin: "At least it's better than "booger.""
Lily Aldrin: "Booger."

FLOAT

Wan2-2 S2V