BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation
Abstract
Bokeh and monocular depth estimation are tightly coupled through the same lens imaging geometry, yet current methods exploit this connection in incomplete ways. High-quality bokeh rendering pipelines typically depend on noisy depth maps, which amplify estimation errors into visible artifacts, while modern monocular metric depth models still struggle on weakly textured, distant and geometrically ambiguous regions where defocus cues are most informative. We introduce BokehDepth, a two-stage framework that decouples bokeh synthesis from depth prediction and treats defocus as an auxiliary supervision-free geometric cue. In Stage-1, a physically guided controllable bokeh generator, built on a powerful pretrained image editing backbone, produces depth-free bokeh stacks with calibrated bokeh strength from a single sharp input. In Stage-2, a lightweight defocus-aware aggregation module plugs into existing monocular depth encoders, fuses features along the defocus dimension, and exposes stable depth-sensitive variations while leaving downstream decoder unchanged. Across challenging benchmarks, BokehDepth improves visual fidelity over depth-map-based bokeh baselines and consistently boosts the metric accuracy and robustness of strong monocular depth foundation models.
Overview
From monocular depth and depth-based bokeh to BokehDepth. (a) Standard monocular depth estimation predicts a depth map from a single RGB image. (b) Classical bokeh rendering takes an image and its depth map as input to synthesize bokeh. (c) BokehDepth first generates a calibrated bokeh stack from the input image, and then uses the induced defocus cues to enhance depth estimation.
Method
BokehDepth architecture. (a) Stage-1 bokeh generation augments a pretrained I2I model, such as FLUX-Kontext, with a bokeh cross-attention adapter that takes a scalar bokeh strength K and produces a calibrated multi-strength bokeh stack from a single sharp image. (b) Stage-2 bokeh stack fusion inserts Divided Space Focus (DSF) Attention into a ViT encoder and uses FiLM conditioning to inject the bokeh stack along the defocus axis, then feeds the aggregated layerwise features to an unchanged DPT decoder to predict metric depth.
Results
Qualitative results of BokehDepth using the Depth Anything V2. From top to bottom: the input image, three representative frames from the Stage-1 bokeh stack, the Stage-2 depth prediction, the error map of BokehDepth, the Depth Anything V2 prediction, the corresponding error map, the ground truth depth, the ΔError map that reports the per-pixel reduction in absolute depth error of BokehDepth over the base model, and the RGB image overlaid with green regions that mark where our method produces notable improvements. BokehDepth lowers depth errors on fine structures, weakly-textured walls and distant background regions, offering more distinct layer separation and steadier metric depth across varied scenes.
BibTeX
@article{zhang2025bokehdepth,
title={BokehDepth: Enhancing Monocular Depth Estimation through Bokeh Generation},
author={Hangwei Zhang and Armando Teles Fortes and Tianyi Wei and Xingang Pan},
journal={arXiv preprint arXiv:2512.12425},
year={2025}
}