1The Hong Kong University of Science and Technology
2South China University of Technology
3The Chinese University of Hong Kong
4Massachusetts Institute of Technology
5Zhejiang University
Corresponding author
There exists a strong coupling relationship between 3D human motion and video generation. High-quality 3D motion can derive high-fidelity generated videos, and reversely, the powerful prior of VDMs can enhance the generalization capabilities of 3D motion generation.
CoMoVi consists of an effective 2D human motion representation to encode 3D motion information in pixel space, and a dual-branch diffusion model extended from Wan2.2-I2V-5B to couple 2D motion and RGB video sequence denoising process with 3D-2D cross-attention modules to concurrently generate 3D human motion.
🎇 CoMoVi dataset is featured by:
🔎 What you can do using CoMoVi dataset:
@InProceedings{zhao2026comovi, title = {CoMoVi: Co-Generation of 3D Human Motions and Realistic Videos}, author = {Zhao, Chengfeng and Shu, Jiazhi and Zhao, Yubo and Huang, Tianyu and Lu, Jiahao and Gu, Zekai and Ren, Chengwei and Dou, Zhiyang and Shuai, Qing and Liu, Yuan}, journal = {arXiv preprint arXiv:2601.10632}, year = {2026} }