Search Papers | Poster Sessions | All Posters

Poster A116 in Poster Session A - Tuesday, August 6, 2024, 4:15 – 6:15 pm, Johnson Ice Rink

Modeling the Effects of Language on Visual Perception with Deep Learning

Jay Gopal1 (), Corey Wood1, Drew Linsley1, Pinyuan Feng1, Thomas Serre1; 1Brown University

The modulatory effect of language on visual perception has been demonstrated in multiple domains, but the mechanisms behind the neural circuits governing this interaction remain unclear. Recently, new approaches have been developed to allow deep neural networks (DNNs) to jointly learn vision and language processing. To investigate if these novel model architectures can help us understand the circuitry of language and vision, we evaluate how a zoo of DNNs compares to humans in classifying binarized Mooney images. We show that as vision-only and dual-stream language/vision feedback models have improved on ImageNet, they have become more accurate at Mooney image classification, but still fail to match human performance. However, we demonstrate that priming a single-stream vision-language DNN with language can cause it to perform similarly to humans. Our results suggest that modern vision-language DNNs represent a new opportunity to generate hypotheses on the neural feedback circuits underlying language's ability to modulate visual representations.

Keywords: multimodality visual representations priming psychophysics 

View Paper PDF