RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion

Jianxin Huang

Jiahang Li

Ning Jia

Yuxiang Sun

Chengju Liu

Qijun Chen

Rui Fan

[Paper]

[GitHub]

The code can be found in this repository.

Abstract

Task-specific data-fusion networks have marked considerable achievements in urban scene parsing. Among these networks, our recently proposed RoadFormer successfully extracts heterogeneous features from RGB images and surface normal maps and fuses these features through attention mechanisms, demonstrating compelling efficacy in RGB-Normal road scene parsing. However, its performance significantly deteriorates when handling other types/sources of data or performing more universal, all-category scene parsing tasks. To overcome these limitations, this study introduces RoadFormer+, an efficient, robust, and adaptable model capable of effectively fusing RGB-X data, where ``X'', represents additional types/modalities of data such as depth, thermal, surface normal, and polarization. Specifically, we propose a novel hybrid feature decoupling encoder to extract heterogeneous features and decouple them into global and local components. These decoupled features are then fused through a dual-branch multi-scale heterogeneous feature fusion block, which employs parallel Transformer attentions and convolutional neural network modules to merge multi-scale features across different scales and receptive fields. The fused features are subsequently fed into a decoder to generate the final semantic predictions. Notably, our proposed RoadFormer+ ranks first on the KITTI Road benchmark and achieves state-of-the-art performance in mean intersection over union on the Cityscapes, MFNet, FMB, and ZJU datasets. Moreover, it reduces the number of learnable parameters by 65% compared to RoadFormer.

Paper and Supplementary Material

J. Huang, J. Li, J. Li, N. Jia, Y. Sun, C. Liu, Q. Chen, R. Fan.
RoadFormer+: Delivering RGB-X Scene Parsing through Scale-Aware Information Decoupling and Advanced Heterogeneous Feature Fusion
(hosted on ArXiv)

[Bibtex]

Acknowledgements

This research was supported by the National Natural Science Foundation of China under Grants 62233013, 62333017, and 62173248, the Science and Technology Commission of Shanghai Municipal under Grant 22511104500, Hong Kong Research Grants Council under Grant 15222523, the Fundamental Research Funds for the Central Universities, and Xiaomi Young Talents Program. (Jianxin Huang and Jiahang Li are joint first authors.) (Corresponding author: Rui Fan).