SCIPaD: Incorporating Spatial Clues into Unsupervised Pose-Depth Joint Learning

Demo Video

Abstract

Unsupervised monocular depth estimation frameworks have shown promising performance in autonomous driving. However, existing solutions primarily rely on a simple convolutional neural network for ego-motion recovery, which struggles to estimate precise camera poses in dynamic, complicated real-world scenarios. These inaccurately estimated camera poses can inevitably deteriorate the photometric reconstruction and mislead the depth estimation networks with wrong supervisory signals. In this article, we introduce SCIPaD, a novel approach that incorporates spatial clues for depth-pose joint learning. Specifically, a confidence-aware feature flow estimator is proposed to acquire 2D feature positional translations. Meanwhile, we introduce a positional clue aggregator, which integrates pseudo 3D point clouds from DepthNet and 2D feature flows into homogeneous positional representations. Finally, a hierarchical positional embedding injector is proposed to selectively inject spatial clues into semantic features for robust camera pose decoding. Extensive experiments and analyses demonstrate the superior performance of our model compared to other state-of-the-art methods. Our source code is available at mias.group/SCIPaD.

Methodology

An illustration of our proposed SCIPaD framework. Compared with the traditional PoseNet architecture, it comprises three main parts: (1) a confidence-aware feature flow estimator, (2) a spatial clue aggregator, and (3) a hierarchical positional embedding injector.

An illustration of our proposed confidence-aware feature flow estimator. It takes batch-wise separated features as input, and produces feature flow \boldsymbol{S}^r_i and its confidence \boldsymbol{C}_i through a differentiable 2D soft argmax function.

BibTeX

@inproceedings{feng2024scipad,
  title={SCIPaD: Incorporating Spatial Clues into Unsupervised Pose-Depth Joint Learning},
  author={Feng, Yi and Guo, Zizhan and Chen, Qijun and Fan, Rui},
  journal={IEEE Transactions on Intelligent Vehicles},
  year={2024},
  publisher={IEEE}
}

Acknowledgements

This research was supported by the Science and Technology Commission of Shanghai Municipal under Grant 22511104500, the National Natural Science Foundation of China under Grant 62233013, the Fundamental Research Funds for the Central Universities, and the Xiaomi Young Talents Program.

SCIPaD: Incorporating Spatial Clues into Unsupervised Pose-Depth Joint Learning

Demo Video

Abstract

Methodology

Experimental results

Qualitative experimental results on the KITTI Eigen benchmark. The regions highlighted in the red boxes illustrate that our method produces locally consistent depth maps with enhanced details.

Comparison of the estimated trajectories using Seqs. 09 and 10 on the KITTI Odometry dataset are rescaled to align with the ground truth for a fair comparison.

Qualitative zero-shot results on the Make3D dataset.

BibTeX

Acknowledgements