Search Papers | Poster Sessions | All Posters

Contributed Talks I

Talk Session: Wednesday, August 7, 2024, 10:30 – 11:30 am, Kresge Hall

10:30 am

HMAX Strikes Back: Self-supervised Learning of Human-Like Scale Invariant Representations

Ivan Felipe Rodriguez¹ (), Nishka Pant¹, Arjun Beniwal², Scott Warren³, Thomas Serre¹; ¹Brown University, ²New York University, ³Brown Medical School

Early hierarchical models of the visual cortex, such as HMAX , have now been superseded by modern deep neural networks. Modern deep neural networks optimized for image categorization have been shown to outperform HMAX (and related models) significantly on image categorization tasks and to fit better neural data from the visual cortex, even though they were not explicitly constrained by neuroscience data. However, earlier hierarchical models were also trained with simpler local learning rules in the pre-deep learning era. So far, these models have yet to be updated with modern gradient-based training methods. Here, we describe a novel contrastive learning algorithm to train HMAX (CHMAX) to learn scale-invariant object representations. Unlike standard deep neural networks trained with data augmentation methods, we show that CHMAX learns visual representations that generalize to novel objects at levels of generalizations comparable to human observers. We hope our results will help spur some renewed interest in other classic biologically-inspired vision models.

10:42 am

Retinotopy in CNN's implements Efficient Visual Search

Jean-Nicolas Jeremie¹ (), Emmanuel Daucé², Laurent Perrinet¹; ¹Institut de Neurosciences de la Timone, ²Central Méditéranné / Institut de Neurosciences de la Timone

While foveated vision, a trait shared by many animals including humans, is a major contributor to biological visual performance, it has been underutilized in machine learning applications. This study investigates whether retinotopic mapping, a critical component of foveated vision, can enhance image categorization and localization performance when integrated into deep convolutional neural networks (CNN's). Retinotopic mapping was used to transform the inputs of standard off-the-shelf CNN's which were then retrained on the Imagenet task. Surprisingly, the networks with retinotopically-mapped inputs achieved a comparable performance in classification. Furthermore, the networks demonstrated improved classification localization when the foveated center of the transform was moved on the whole image. This replicates a crucial ability of the human visual system that is absent in typical CNN's. These findings suggest that retinotopic mapping may be fundamental to significant preattentive visual processes, in particular the retinotopic version seems to be the best option when applying one of these networks to a visual search task.

10:54 am

Dynamic, social vision highlights gaps between deep learning and human behavior and neural responses

Kathy Garcia¹ (), Emalie McMahon¹, Colin Conwell¹, Michael F. Bonner¹, Leyla Isik¹; ¹Johns Hopkins University

To date, deep learning models trained for computer vision tasks are the best models of human vision. This work has largely focused on behavioral and neural responses to static images, but the visual world is highly dynamic, and recent work has suggested that in addition to the ventral visual stream specializing in static object recognition, there is a lateral visual stream that processes dynamic, social content. Here, we investigated the ability of 350+ modern image, video, and language models to predict human ratings of visual-social content of short video clips and neural responses to the same videos. We find that unlike prior benchmarks, even the best image-trained models do a poor job of explaining human behavioral judgements and neural responses. Language models outperform vision models in predicting behavior but are less effective at modeling neural responses. In early and mid-level lateral visual regions, video-trained models predicted neural responses far better than image-trained models. However, prediction by all models was overall lower in lateral than ventral visual regions of the brain, particularly in the superior temporal sulcus. Together, these results reveal a key gap in modern deep learning models' ability to match human responses to dynamic visual scenes.

11:06 am

Shared connectome and organization in the human cortex irrespective of sensory experience

Guo Jiahui¹ (), Francesca Setti², Ma Feilong³, Davide Bottari², Maria Ida Gobbini⁴, Pietro Pietrini², Emiliano Ricciardi², James V. Haxby³; ¹The University of Texas at Dallas, ²IMT School for Advanced Studies Lucca, ³Dartmouth College, ⁴University of Bologna

To what extent is sensory experience a prerequisite for the development of the functional architecture of the high-level human visual cortex? In this study, congenitally blind and deaf participants were presented with audio-only and video-only versions of the live-action movie 101 Dalmatians. Three control groups of participants either watched and/or listened to the audiovisual, the audio-only, and the video-only versions of the movie. Using fMRI data from an independent group of participants, individualized category-selective topographies were successfully predicted in both congenitally blind and deaf participants. Category-selective topographies in the ventral visual pathway in congenitally blind participants were highly comparable to those in sighted participants. Functional connectomes were notably similar across the entire cortex, regardless of the modality of sensory input or the content of the stimuli. This study demonstrates that under real-world conditions, the connectome has a similar organization across varying sensory modalities and content, and shows that development of the functional organization of the human high-level cortex can occur independently of prior sensory experience.

11:18 am

Duality of Bures and Shape Distances with Implications for Comparing Neural Representations

Sarah Harvey¹ (), Brett Larsen¹, Alex Williams^1,2; ¹Flatiron Institute, ²New York University

How should neuroscientists mathematically evaluate whether two individuals or networks have similar neural representations? A multitude of (dis)similarity measures between neural network representations have been proposed, resulting in a fragmented research landscape. Most of these measures fall into one of two categories. First, measures such as linear regression, canonical correlations analysis (CCA), and shape distances, all learn explicit mappings between neural units to quantify similarity while accounting for expected invariances. Second, measures such as representational similarity analysis (RSA), centered kernel alignment (CKA), and normalized Bures similarity (NBS) all quantify similarity in summary statistics, such as stimulus-by-stimulus kernel matrices, which are already invariant to expected symmetries. Here, we take steps towards unifying these two broad categories of methods by observing that the cosine of the Riemannian shape distance (from category 1) is equal to NBS (from category 2). We explore how this connection leads to new interpretations of shape distances and NBS, and draw contrasts of these measures with CKA, a popular similarity measure in the deep learning literature.