Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control

SIGGRAPH 2025

Zekai Gu¹, Rui Yan², Jiahao Lu¹, Peng Li¹, Zhiyang Dou³, Chenyang Si⁴, Zhen Dong⁵, Qifeng Liu¹, Cheng Lin³, Ziwei Liu⁴, Wenping Wang⁶, Yuan Liu¹

¹The Hong Kong University of Science and Technology, ²Zhejiang University, ³The University of Hong Kong,
⁴Nanyang Technological University ⁵Wuhan University ⁶Texas A&M University

Paper Code arXiv Gallery

We present Diffusion as Shader (DaS), a versatile video generation control model
for all the following tasks.

Abstract

Diffusion models have demonstrated impressive performance in generating high-quality videos from text prompts or images. However, precise control over the video generation process—such as camera manipulation or content editing—remains a significant challenge. Existing methods for controlled video generation are typically limited to a single control type, lacking the flexibility to handle diverse control demands. In this paper, we introduce Diffusion as Shader (DaS), a novel approach that supports multiple video control tasks within a unified architecture. Our key insight is that achieving versatile video control necessitates leveraging 3D control signals, as videos are fundamentally 2D renderings of dynamic 3D content. Unlike prior methods limited to 2D control signals, DaS leverages 3D tracking videos as control inputs, making the video diffusion process inherently 3D-aware. This innovation allows DaS to achieve a wide range of video controls by simply manipulating the 3D tracking videos. A further advantage of using 3D tracking videos is their ability to effectively link frames, significantly enhancing the temporal consistency of the generated videos. With just 3 days of fine-tuning on 8 H800 GPUs using less than 10k videos, DaS demonstrates strong control capabilities across diverse tasks, including mesh-to-video generation, camera control, motion transfer, and object manipulation.

Object Manipulation More Results

Object 1

Object 2

Object 1

Object 2

DaS can generate a video to manipulate a specific object. Given an image, we estimate the depth map using Depth Pro or MoGE and segment out the object using SAM. Then, we are able to manipulate the point cloud of the object to construct a 3D tracking video for object manipulation video generation.

Animating meshes to videos More Results

Rendering style: A muscular man

Rendering style: Man in a suit

DaS enables the creation of visually appealing, high-quality videos from simple animated meshes. While many Computer Graphics (CG) software tools provide basic 3D models and motion templates to generate animated meshes, these outputs are often simplistic and lack the detailed appearance and geometry needed for high-quality animations. Starting with these simple animated meshes, we generate an initial visually appealing frame using a depth-to-image FLUX model. We then produce 3D tracking videos from the animated meshes, which, when combined with the generated first frame, guide DaS to transform the basic meshes into visually rich and appealing videos.

Camera Control More Results

Camera Control

Moving Up

Moving Down

Moving Right

Moving Left

DaS significantly enhances 3D awareness by incorporating 3D tracking videos for precise camera control. To generate videos with a specific camera trajectory, we first estimate the depth map of the initial frame using Depth Pro and convert it into colored 3D points. These points are then projected onto the given camera trajectory, constructing a 3D tracking video that enables DaS to control camera movements with high 3D accuracy.

Moving Right

Moving Left

Moving Up

Moving Down

Spiral More Results

Scene 1

Scene 2

Motion Transfer More Results

Source

Transferred1

Transferred2

Transferred3

"A yellow fox runs on the grass"

“A herd of bird-deer in a towering, wooded forest.”

DaS also facilitates creating a new video by transferring motion from an existing source video. First, we estimate the depth map of the source video’s first frame and apply the depth-to-image FLUX model to repaint the frame into a target appearance guided by text prompts. Then, using SpatialTracker, we generate a 3D tracking video from the source video to serve as control signals. Finally, the DaS model generates the target video by combining the edited first frame with the 3D tracking video.

BibTeX

@article{gu2025das,
        title={Diffusion as Shader: 3D-aware Video Diffusion for Versatile Video Generation Control}, 
         author={Zekai Gu and Rui Yan and Jiahao Lu and Peng Li and Zhiyang Dou and Chenyang Si and Zhen Dong and Qifeng Liu and Cheng Lin and Ziwei Liu and Wenping Wang and Yuan Liu},
         year={2025},
         journal={arXiv preprint arXiv:2501.03847}
        }