StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Chongjie Ye^*,1,2, Lingteng Qiu^*,1,2,3,

Xiaodong Gu³, Qi Zuo³, Yushuang Wu^1,2,

Zilong Dong³, Liefeng Bo³,

Yuliang Xiu^†,4, Xiaoguang Han^†,2,1

¹FNii, The Chinese University of Hongkong, Shenzhen, China
²SSE, The Chinese University of Hongkong, Shenzhen, China
³Alibaba Group, China
⁴Max Planck Institute for Intelligent Systems, Germany
^*Equal contribution, ^†Corresponding Author

SIGGRAPH Asia 2024 (Journal Track)

Paper Code Video 🤗 Demo

This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs ( i.e., images and videos), a field which is recently revolutionized by repurposing diffusion priors. However, these attempts still struggle with high-variance inference, which conflicts with the deterministic nature of Image2Normal task. Our method, StableNormal, aims to reduce the inference variance, thus producing “stable” and “sharp” normal estimates, even under challenging imaging conditions, such as extreme lighting, motion/defocus blur, and low-quality/compressed images. It is also robust against transparent and reflective surfaces, as well as cluttered scenes with numerous objects. Specifically, StableNormal employs a coarseto-fine strategy, which starts with a one-step normal estimator (YOSO) to establish a reliable initial normal, that is relatively coarse, then followed by a semantic-guided refinement process (SG-DRN) that refines the normals to recover geometric details. The effectiveness of StableNormal is demonstrated through competitive performance on standard datasets like DIODE-indoor, iBims, ScannetV2, and NYUv2, and its capability in enhancing various downstream tasks, such as surface reconstruction and normal enhancement, is also showcased. These results evidence that StableNormal retains both the “stability” and “sharpness” necessary for accurate normal estimation. Our StableNormal is a good step to repurpose diffusion priors for deterministic estimation. To democratize this, code and models will be publicly available.

Pipeline

The overall pipeline is composed of two stages: 1) YOSO aims to produce a confident initialization 𝑥_t with a novel Noise-Decoupled MSE Loss; 2) SG-DRN plays the role of stable denoising, by leveraging the stronger semantic control information extracted from DINO [Oquab et al. 2024]. The textual prompt for the U-Net in both stages is set to “The normal map”.

StableNormal: Reducing Diffusion Variance for Stable and Sharp Normal

Pipeline

Comparison Gallery

Citation