CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos


Chengfeng Zhao1, Jiazhi Shu2, Yubo Zhao1, Tianyu Huang3, Jiahao Lu1,
Zekai Gu1, Chengwei Ren1, Zhiyang Dou4, Qing Shuai5, Yuan Liu1

1The Hong Kong University of Science and Technology    2South China University of Technology   
3The Chinese University of Hong Kong    4Massachusetts Institute of Technology    5Zhejiang University   
Corresponding author

In this paper, we find that the generation of 3D human motions and 2D human videos is intrinsically coupled. 3D motions provide the structural prior for plausibility and consistency in videos, while pre-trained video models offer strong generalization capabilities for motions, which necessitate coupling their generation processes. Based on this, we present CoMoVi, a co-generative framework that couples two video diffusion models (VDMs) to generate 3D human motions and videos synchronously within a single diffusion denoising loop. To achieve this, we first propose an effective 2D human motion representation that can inherit the powerful prior of pre-trained VDMs. Then, we design a dual-branch diffusion model to couple human motion and video generation process with mutual feature interaction and 3D-2D cross attentions. Moreover, we curate CoMoVi, a large-scale real-world human video dataset with text and motion annotations, covering diverse and challenging human motions. Extensive experiments demonstrate the effectiveness of our method in both 3D human motion and video generation tasks.
Motivation

There exists a strong coupling relationship between 3D human motion and video generation. High-quality 3D motion can derive high-fidelity generated videos, and reversely, the powerful prior of VDMs can enhance the generalization capabilities of 3D motion generation.

Pipeline

CoMoVi consists of an effective 2D human motion representation to encode 3D motion information in pixel space, and a dual-branch diffusion model extended from Wan2.2-I2V-5B to couple 2D motion and RGB video sequence denoising process with 3D-2D cross-attention modules to concurrently generate 3D human motion.

Our Results
Qualitative Comparisons
Video Generation Comparison
Motion Generation Comparison
A woman in black and white patterned yoga attire performs the downward-facing dog yoga pose.
A woman in beige clothing does a seated forward bend pose, twisting her torso and wrapping her left arm around her back.
A shirtless man with black pants performs a seated twisting yoga pose, clasping his hands behind his back while on a yoga mat by a lake.
A woman in a red shirt and blue leggings transitions from a low lunge with a twist to Warrior II pose.
A woman with short dark hair, wearing a white tank top and light blue leggings, performs a seated balancing yoga pose on a mat with her hands on yoga blocks, while talking and gesturing with her hands.
👈 CogVideoX1.5-I2V-5B
Ours
👉 Wan2.2-I2V-5B
CoMoVi Dataset
Daily Motion
Popular Sports
Strength Training
Calisthenics
Mind-Body & Flexibility
Stationary Cardio
Target & Precision Sports
Indoor & Stationary Equipment
Wheeled Sports
Winter Sports
Water Sports
Air Sports
Dancing

🎇 CoMoVi dataset is featured by:

🔎 What you can do using CoMoVi dataset:

Citation

@InProceedings{zhao2026comovi, title = {CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos}, author = {Zhao, Chengfeng and Shu, Jiazhi and Zhao, Yubo and Huang, Tianyu and Lu, Jiahao and Gu, Zekai and Ren, Chengwei and Dou, Zhiyang and Shuai, Qing and Liu, Yuan}, journal = {arXiv preprint arXiv:2601.10632}, year = {2026} }

CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos
Thanks to this website and this website for the awesome template