Search Papers | Poster Sessions | All Posters
Poster B119 in Poster Session B - Thursday, August 8, 2024, 1:30 – 3:30 pm, Johnson Ice Rink
Transformer-based model captures neural representation of audio-visual speech in natural scenes
Wenyuan Yu1, Zhoujian Sun1, Yunyi Qi1, Cheng Luo1 (); 1zhejiang Lab
Natural scenes comprise both visual and auditory information, which are integrated to form a coherent perception. The multi-modal encoding of the auditory and visual information has been extensively investigated in biological brains and various advanced deep neural networks (DNNs) respectively. However, little is known about the underlying relationship between information representations in biological brains and DNN models. In the current study, we investigate whether humans and DNNs represent the auditory and visual information in a comparable way during audio-visual speech recognition (AVSR). For humans, we used electroencephalography (EEG) recordings to analyze neural activity of auditory and visual features when participants engaged in a speech recognition task within audio-visual scenes. For DNNs, we analyzed hidden layers’ embeddings from a transformer-based model, i.e., AV-HuBERT, which achieves state-of-the-art performance in AVSR tasks. We observed significant representational similarity between the EEG responses and model embeddings. Further analysis revealed that the model embeddings from lower hidden layers exhibited the greater similarity with the neural encoding of visual and auditory features. These results suggest that DNNs can naturally evolve human-like information representations, and their hidden layers’ embeddings effectively capture auditory and visual patterns in neural representations of humans.
Keywords: EEG AVSR Information representation