MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls

Yuxuan Bian1, Ailing Zeng# 2, Xuan Ju1, Xian Liu1, Zhaoyang Zhang1, Wei Liu2, and Qiang Xu# 1

#Corresponding Authors.
1The Chinese University of Hong Kong, 2Tencent

Abstract

Whole-body multimodal motion generation, controlled by text, speech, or music, has numerous applications including video generation and character animation. However, employing a unified model to achieve various generation tasks with different condition modalities presents two main challenges: motion distribution drifts across different tasks (e.g., co-speech gestures and text-driven daily actions) and the complex optimization of mixed conditions with varying granularities (e.g., text and audio). Additionally, inconsistent motion formats across different tasks and datasets hinder effective training toward multimodal motion generation. In this paper, we propose MotionCraft, a unified diffusion transformer that crafts whole-body motion with plug-and-play multimodal control. Our framework employs a coarse-to-fine training strategy, starting with the first stage of text-to-motion semantic pre-training, followed by the second stage of multimodal low-level control adaptation to handle conditions of varying granularities. To effectively learn and transfer motion knowledge across different distributions, we design MC-Attn for parallel modeling of static and dynamic human topology graphs. To overcome the motion format inconsistency of existing benchmarks, we introduce MC-Bench, the first available multimodal whole-body motion generation benchmark based on the unified SMPL-X format. Extensive experiments show that MotionCraft achieves state-of-the-art performance on various standard motion generation tasks.

Our Main Contribution

Multimodal Video Generation Application with Our Generated Motions

System Overview

Demo Video of Motion Generation Tasks

Comparing with Baselines in Text2Motion

Comparing with Baselines in Speech2Gesture

Comparing with Baselines in Music2Dance

Citation

@article{MotionCraft,
  title={MotionCraft: Crafting Whole-Body Motion with Plug-and-Play Multimodal Controls},
  author={Bian, Yuxuan, Zeng Ailing, Ju Xuan, Liu Xian, Zhang Zhaoyang, Liu Wei, and Xu Qiang},
  journal={arXiv preprint arXiv:2407.21136},
  year={2024}
}

The website template was adapted from HumanTomato Project.