Unsupervised monocular depth estimation frameworks have shown promising performance in autonomous driving. However, existing solutions primarily rely on a simple convolutional neural network for ego-motion recovery, which struggles to estimate precise camera poses in dynamic, complicated real-world scenarios. These inaccurately estimated camera poses can inevitably deteriorate the photometric reconstruction and mislead the depth estimation networks with wrong supervisory signals. In this article, we introduce SCIPaD, a novel approach that incorporates spatial clues for depth-pose joint learning. Specifically, a confidence-aware feature flow estimator is proposed to acquire 2D feature positional translations. Meanwhile, we introduce a positional clue aggregator, which integrates pseudo 3D point clouds from DepthNet and 2D feature flows into homogeneous positional representations. Finally, a hierarchical positional embedding injector is proposed to selectively inject spatial clues into semantic features for robust camera pose decoding. Extensive experiments and analyses demonstrate the superior performance of our model compared to other state-of-the-art methods. Our source code is available at mias.group/SCIPaD.
An illustration of our proposed SCIPaD framework. Compared with the traditional PoseNet architecture, it comprises three main parts: (1) a confidence-aware feature flow estimator, (2) a spatial clue aggregator, and (3) a hierarchical positional embedding injector.
An illustration of our proposed confidence-aware feature flow estimator. It takes batch-wise separated
features as input, and produces feature flow
@inproceedings{feng2024scipad,
title={SCIPaD: Incorporating Spatial Clues into Unsupervised Pose-Depth Joint Learning},
author={Feng, Yi and Guo, Zizhan and Chen, Qijun and Fan, Rui},
journal={IEEE Transactions on Intelligent Vehicles},
year={2024},
publisher={IEEE}
}