Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection
Citation:
W. Lee, L. Jovanov and W. Philips, "Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection", in Computer Vision – ECCV 2022 Workshops, L. Karlinsky, T. Michaeli, K. Nishino, Eds., Tel Aviv, Israel: Springer Nature Switzerland, pp. 608–623, 2023, doi: 10.1007/978-3-031-25072-9_41.
Bibtex Entry:
@InProceedings{Weiyu2022ECCVW,
  author      = {Lee, Wei-Yu and Jovanov, Ljubomir and Philips, Wilfried},
  title       = {Cross-Modality Attention and Multimodal Fusion Transformer for Pedestrian Detection},
  booktitle   = {Computer Vision -- ECCV 2022 Workshops},
  year        = {2023},
  editor      = {Karlinsky, Leonid and Michaeli, Tomer and Nishino, Ko},
  pages       = {608--623},
  address     = {Tel Aviv, Israel},
  month       = oct,
  publisher   = {Springer Nature Switzerland},
  abstract    = {Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.},
  doi         = {10.1007/978-3-031-25072-9_41},
  isbn        = {978-3-031-25072-9},
  url         = {https://link.springer.com/chapter/10.1007/978-3-031-25072-9_41},
}
Abstract:
Pedestrian detection is an important challenge in computer vision due to its various applications. To achieve more accurate results, thermal images have been widely exploited as complementary information to assist conventional RGB-based detection. Although existing methods have developed numerous fusion strategies to utilize the complementary features, research that focuses on exploring features exclusive to each modality is limited. On this account, the features specific to one modality cannot be fully utilized and the fusion results could be easily dominated by the other modality, which limits the upper bound of discrimination ability. Hence, we propose the Cross-modality Attention Transformer (CAT) to explore the potential of modality-specific features. Further, we introduce the Multimodal Fusion Transformer (MFT) to identify the correlations between the modality data and perform feature fusion. In addition, a content-aware objective function is proposed to learn better feature representations. The experiments show that our method can achieve state-of-the-art detection performance on public datasets. The ablation studies also show the effectiveness of the proposed components.
Import into reference manager:
If your reference manager supports it (e.g. Bookends, Mendeley, and Zotero), this entry can be directly imported via this webpage. Please refer to your reference manager's support files if you need help saving/importing references from websites.