DidSeeđź‘“: Diffusion-Based Depth Completion for Material-Agnostic Robotic Perception and Manipulation

arXiv 2025

Wenzhou Lyu1 , Jialing Lin1, Wenqi Ren1, Ruihao Xia1 , Feng Qian1, Yang Tang1
1East China University of Science and Technology
Teaser

We propose DidSee, a diffusion-based depth completion framework for non-Lambertian objects. DidSee reduces depth restoration errors by mitigating signal leakage and exposure biases while incorporating a novel semantic enhancer (Right). Without scaling up data size, it generalizes robustly to in-the-wild scenes (Left) and facilitates material-agnostic robotic perception and manipulation in real-world scenarios.

Abstract

Commercial RGB-D cameras often produce noisy, incomplete depth maps for non-Lambertian objects. Traditional depth completion methods struggle to generalize due to the limited diversity and scale of training data. Recent advances exploit visual priors from pre-trained text-to-image diffusion models to enhance generalization in dense prediction tasks.

However, we find that biases arising from training-inference mismatches in the vanilla diffusion framework significantly impair depth completion performance. Additionally, the lack of distinct visual features in non-Lambertian regions further hinders precise prediction.

To address these issues, we propose DidSee, a diffusion-based framework for depth completion on non-Lambertian objects. First, we integrate a rescaled noise scheduler enforcing a zero terminal signal-to-noise ratio to eliminate signal leakage bias. Second, we devise a noise-agnostic single-step training formulation to alleviate error accumulation caused by exposure bias and optimize the model with a task-specific loss. Finally, we incorporate a semantic enhancer that enables joint depth completion and semantic segmentation, distinguishing objects from backgrounds and yielding precise, fine-grained depth maps.

DidSee achieves state-of-the-art performance on multiple benchmarks, demonstrates robust real-world generalization, and effectively improves downstream tasks such as category-level pose estimation and robotic grasping.

Methodology

DidSee: Diffusion-Based Depth Completion

During training, the pre-trained VAE encoder $\mathcal{E}$ encodes the image $\mathbf{x}$, raw depth $\mathbf{d}$, and ground truth depth $\mathbf{y^{d}}$ into latent space, producing $\mathbf{z^x}$, $\mathbf{z^d}$, and $\mathbf{z^y_0}$, respectively. ①The noisy input $\mathbf{z}_t^\mathbf{y}$ is generated using a rescaled noise scheduler, which enforces a terminal-SNR of zero to eliminate signal leakage bias. ②We adopt a noise-agnostic single-step diffusion formulation with a fixed timestep $t=T$ to mitigate the exposure bias that arises during multi-step sampling. In this formulation, the model's prediction $\hat{\mathbf{v}}_T$ equals the estimated latent $\hat{\mathbf{z}}_0^\mathbf{y}$, which is then decoded by the VAE decoder $\mathcal{D}$ into a depth map. Consequently, we supervise the denoising model $f_{\theta}$ using a task-specific loss in pixel space to enhance performance ③We introduce a novel semantic enhancer that enables the model to jointly perform depth completion and semantic regression, improving object-background distinction and ensuring fine-grained depth prediction. ④The restored depth maps can be applied to downstream tasks, such as pose estimation and robotic grasping on non-Lambertian objects.

DidSee Framework

Experiments

Quantitative Comparison with other methods

Quantitative comparison of DidSee with SoTA methods on several non-Lambertian depth completion benchmarks. The best and second best performances are highlighted. DidSee outperforms all others methods.

Comparison with other methods

Qualitive Comparison with other methods

Qualitive comparison of DidSee with SoTA methods on STD dataset. DidSee generates fewer artifacts, sharper boundaries and more precise scene reconstructions.

Comparison with other methods

Material-Agnostic Scene Reconstruction and Robotic Grasping

More Qualitative Results

In-the-wild Scenes

Comparison with other methods Comparison with other methods

Please refer to our paper for more technical details :)