Kockwelp, Jacqueline; Beckmann, Daniel; Risse, Benjamin
Forschungsartikel in Online-Sammlung (Konferenz) | Peer reviewedHuman attention plays a crucial role in visual perception and decision-making, opening new possibilities for integration with machine learning models. While Transformer models excel in modeling global relationships via self-attention, understanding the importance of specific image regions for their decision-making remains challenging. This paper investigates the intersection of human gaze and Transformer-based attention in the context of object classification tasks, focusing on how gaze-prioritized regions correspond to Transformer attention. We extend the analysis of the attention mechanism during inference by focusing the attention of pretrained Vision Transformers to regions of interest which we derive directly from human gaze. Our findings indicate that gaze-based token masking can not only reduce the number of tokens necessary for robust model performance but might also improve classification accuracy over using the whole image for certain configurations. Even though this masking can improve model performance, we show that both attention mechanisms have clear structural differences for natural images. Our results shed light on the relationship between human and Transformer attention, providing novel perspectives for optimising Transformer models to achieve more efficient and interpretable image understanding and classification. Code is available at https://zivgitlab.uni-muenster.de/cvmls/gaze-based-token-masking.
Beckmann, Daniel | Professur für Geoinformatics for Sustainable Development (Prof. Risse) |
Kockwelp, Jacqueline | Professur für Geoinformatics for Sustainable Development (Prof. Risse) |
Risse, Benjamin | Professur für Geoinformatics for Sustainable Development (Prof. Risse) |