This work addresses the challenge of high-quality surface normal estimation from monocular colored inputs ( i.e., images and videos), a field which is recently revolutionized by repurposing diffusion priors. However, these attempts still struggle with high-variance inference, which conflicts with the deterministic nature of Image2Normal task. Our method, StableNormal,
aims to reduce the inference variance, thus producing “stable” and “sharp”
normal estimates, even under challenging imaging conditions, such as extreme lighting, motion/defocus blur, and low-quality/compressed images. It
is also robust against transparent and reflective surfaces, as well as cluttered
scenes with numerous objects. Specifically, StableNormal employs a coarseto-fine strategy, which starts with a one-step normal estimator (YOSO) to
establish a reliable initial normal, that is relatively coarse, then followed by
a semantic-guided refinement process (SG-DRN) that refines the normals to
recover geometric details. The effectiveness of StableNormal is demonstrated
through competitive performance on standard datasets like DIODE-indoor,
iBims, ScannetV2, and NYUv2, and its capability in enhancing various downstream tasks, such as surface reconstruction and normal enhancement, is also showcased. These results evidence that StableNormal retains both the
“stability” and “sharpness” necessary for accurate normal estimation. Our
StableNormal is a good step to repurpose diffusion priors for deterministic
estimation. To democratize this, code and models will be publicly available.