MotionCraft

Abstract

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Comparing with Baselines in Text2Motion

FineMoGen

MCM

Ours

Text-aligned natural motions: the figure walks forward and looks like he trips on something.

FineMoGen

MCM

Ours

Precise motion counting: a person does 2 jumping jacks.

FineMoGen

MCM

Ours

High-quality motions with less foot sliding: a person does a squat and raises both arms over its head.

FineMoGen

MCM

Ours

a person walks forward, arms by their side.

FineMoGen

MCM

Ours

a person lifts up a box on the right side and puts it on the left side.

FineMoGen

MCM

Ours

a person jogging in place.

FineMoGen

MCM

Ours

a person uses their right arm to wave.

FineMoGen

MCM

Ours

a person walks in a right bend direction.

FineMoGen

MCM

Ours

a person sits in a chair then stands back up.

FineMoGen

MCM

Ours

a person does a small jump.

FineMoGen

MCM

Ours

a person kneels down on both knees.

FineMoGen

MCM

Ours

the person is stretching arms out.

FineMoGen

MCM

Ours

a person stretched right arm up and over to the left, then left arm up and over to the right.

FineMoGen

MCM

Ours

a person raises both arms and punches with his right arm.

FineMoGen

MCM

Ours

a step forward then a large step like walking over something then back to normal.

FineMoGen

MCM

Ours

the figure plants four steps leading with it's left foot, with a fifth step not planted.

FineMoGen

MCM

Ours

a person walks in place.

FineMoGen

MCM

Ours

a person leaps forward then stands straight.

FineMoGen

MCM

Ours

a person jobs up a few paces in a straight line and stops.

FineMoGen

MCM

Ours

he opened his both arms and then sit on the bench and then start moving his right hand.

FineMoGen

MCM

Ours

a person strides swiftly in a straight line.

FineMoGen

MCM

Ours

he stomps his left feet.

FineMoGen

MCM

Ours

a person makes several gestures with their hands, appear to scratch, stretch their arms and wave their arms around while twisting their torso.

FineMoGen

MCM

Ours

a person is jumping up and down.

FineMoGen

MCM

Ours

the person is scratching head.

We compare the generation result of MotionCraft with baselines. Our model has obvious advantages in controlibility, sequentiality and motion rationality.

Comparing with Baselines in Speech2Gesture

Ground Truth

Ours

EMAGE

MCM

A person is giving a speech, and the content is ...

Ground Truth

Ours

EMAGE

MCM

A person is giving a speech, and the content is ...

Ground Truth

Ours

EMAGE

MCM

A person is giving a speech, and the content is ...

Ground Truth

Ours

EMAGE

MCM

A person is giving a speech, and the content is ...

We compare the generation result of MotionCraft with baselines. Our model has obvious advantages in controlibility, sequentiality and motion rationality.

Comparing with Baselines in Music2Dance

FineDance

MCM

Ours

A dancer is performing a Street dance in the Breaking style to the rhythm of the Just_Begun.

FineDance

MCM

Ours

A dancer is performing a Street dance in the Hiphop style to the rhythm of the idoit4.

FineDance

MCM

Ours

A dancer is performing a Mix dance in the Korean style to the rhythm of the killthislove.

FineDance

MCM

Ours

A dancer is performing a Street dance in the Hiphop style to the rhythm of the sevenFloul.

FineDance

MCM

Ours

A dancer is performing a Street dance in the Jazz style to the rhythm of the PROBLEMA.

FineDance

MCM

Ours

A dancer is performing a Street dance in the Jazz style to the rhythm of the wildfire.

FineDance

MCM

Ours

A dancer is performing a Street dance in the Breaking style to the rhythm of the Just_Begun.

FineDance

MCM

Ours

A dancer is performing a Mix dance in the Choreography style to the rhythm of the shuixingji.

FineDance

MCM

Ours

A dancer is performing a Mix dance in the Korean style to the rhythm of the pinkvenom.

FineDance

MCM

Ours

A dancer is performing a Mix dance in the Korean style to the rhythm of the dreamscometrue.

We compare the generation result of MotionCraft with baselines. Our model has obvious advantages in controlibility, sequentiality and motion rationality.

Citation

@article{MotionCraft,
  title={MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls},
  author={Bian, Yuxuan, Zeng Ailing, Ju Xuan, Liu Xian, Zhang Zhaoyang, Liu Wei, and Xu Qiang},
  journal={arXiv preprint arXiv:2407.21136},
  year={2024}
}

The website template was adapted from HumanTomato Project.

MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

Yuxuan Bian¹, Ailing Zeng^{# 2}, Xuan Ju¹, Xian Liu¹, Zhaoyang Zhang¹, Wei Liu², and Qiang Xu^{# 1}

^#Corresponding Authors.

¹The Chinese University of Hong Kong, ²Tencent

Abstract

Our Main Contribution

Multimodal Video Generation Application with Our Generated Motions

System Overview

Demo Video of Motion Generation Tasks

Comparing with Baselines in Text2Motion

Comparing with Baselines in Speech2Gesture

Comparing with Baselines in Music2Dance

Citation