Search Papers | Poster Sessions | All Posters
Poster B163 in Poster Session B - Thursday, August 8, 2024, 1:30 – 3:30 pm, Johnson Ice Rink
Dynamic, social vision highlights gaps between deep learning and human behavior and neural responses
Kathy Garcia1 (), Emalie McMahon1, Colin Conwell1, Michael F. Bonner1, Leyla Isik1; 1Johns Hopkins University
To date, deep learning models trained for computer vision tasks are the best models of human vision. This work has largely focused on behavioral and neural responses to static images, but the visual world is highly dynamic, and recent work has suggested that in addition to the ventral visual stream specializing in static object recognition, there is a lateral visual stream that processes dynamic, social content. Here, we investigated the ability of 350+ modern image, video, and language models to predict human ratings of visual-social content of short video clips and neural responses to the same videos. We find that unlike prior benchmarks, even the best image-trained models do a poor job of explaining human behavioral judgements and neural responses. Language models outperform vision models in predicting behavior but are less effective at modeling neural responses. In early and mid-level lateral visual regions, video-trained models predicted neural responses far better than image-trained models. However, prediction by all models was overall lower in lateral than ventral visual regions of the brain, particularly in the superior temporal sulcus. Together, these results reveal a key gap in modern deep learning models' ability to match human responses to dynamic visual scenes.
Keywords: vision social perception action recognition NeuroAI