Search Papers | Poster Sessions | All Posters

Poster A153 in Poster Session A - Tuesday, August 6, 2024, 4:15 – 6:15 pm, Johnson Ice Rink

CorText: large language models for cross-modal transformations from visually evoked brain responses to text captions.

Victoria Bosch1 (), Dirk Gütlin2, Adrien Doerig1, Daniel Anthes1, Sushrut Thorat1, Peter König1, Tim C Kietzmann1; 1University of Osnabrück, 2Freie Universität Berlin

An emerging trend in cognitive neuroscience is to investigate neural responses to complex natural scenes. While more ecologically valid, the complexity of these stimuli requires analysis techniques capable of studying not only the neural responses to object categories that constitute a given scene, but also their rich spatial and semantic interactions. Here, we present a generative brain-to-text decoder, CorText, that produces linguistic descriptions of natural scenes based on visually-evoked fMRI responses. At no point does the decoder have access to the visual stimulus, it operates solely on brain data. This cross-modal transformer, consisting of a linear encoder for neural data and a partly frozen pre-trained language decoder, enables us to harness the powerful features of language models to study neural representations. As a proof of concept, we analyse the neural regions most informative for generating specific words by visualising the transformer's attention patterns. This approach reproduces known functional organisation: elevated attention in the ventral stream and, accordingly, attention in cortical regions involved in category-specific processing. This work thus marks an important first advance into end-to-end generative language transformers for investigating complex neural data.

Keywords: neural decoding transformer scene perception cross-modal alignment 

View Paper PDF