Align3R: Aligned Monocular Depth Estimation for Dynamic Videos

CVPR 2025

Jiahao Lu1*, Tianyu Huang2*, Peng Li1, Zhiyang Dou3, Cheng Lin3,
Zhiming Cui4, Zhen Dong5, Sai-Kit Yeung1, Wenping Wang6, Yuan Liu1,7†

1HKUST 2CUHK 3HKU 4ShanghaiTech 5WHU 6TAMU 7NTU

Align3R estimates temporally consistent video depth,

dynamic point clouds, and camera poses from monocular videos

Given two frames of a video, we apply the ViT-based encoder and decoder to predict pairwise point maps from them. In this process, we apply the external monocular depth estimator to estimate depth maps for these two images, process the estimated depth with a new ViT-based encoder, and finally inject the extracted features from this new encoder into the decoder of the original DUSt3R decoder with zero convolution layers. During inference, we apply global alignment to ensure consistent depth maps, camera poses and point clouds across each frame.

Abstract

Recent developments in monocular depth estimation methods enable high-quality depth estimation of single-view images but fail to estimate consistent video depth across different frames. Recent works address this problem by applying a video diffusion model to generate video depth conditioned on the input video, which is training-expensive and can only produce scale-invariant depth values without camera poses. In this paper, we propose a novel video-depth estimation method called Align3R to estimate temporal consistent depth maps for a dynamic video. Our key idea is to utilize the recent DUSt3R model to align estimated monocular depth maps of different timesteps. First, we fine-tune the DUSt3R model with additional estimated monocular depth as inputs for the dynamic scenes. Then, we apply optimization to reconstruct both depth maps and camera poses. Extensive experiments demonstrate that Align3R estimates consistent video depth and camera poses for a monocular video with superior performance than baseline methods.

Results on DAVIS dataset

Results on indoor scenes (TUM dynamics and Bonn datasets)

Results on PointOdyssey and FlyingThings3D datasets

Camera pose estimation