The recent advancements in deep convolutional neural networks have shown significant promise in the domain
of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little
attention given to hazardous road defects that could compromise both driving safety and comfort. In this paper,
we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer
utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal
information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective
feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused
and recalibrated heterogeneous features, which are then processed by a Transformer decoder to produce the final semantic
prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that includes
over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and
road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset,
as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer
outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first
on the KITTI road benchmark. Our source code, created dataset, and demo video are publicly available at
https://mias.group/RoadFormer/.
|