Poster Presentation

Search Papers | Poster Sessions | All Posters

Poster A135 in Poster Session A - Tuesday, August 6, 2024, 4:15 – 6:15 pm, Johnson Ice Rink

Why audio-visual learning improves voice identity recognition: a neurocomputational model

Christian Gumbsch^1,2 (), Martin V. Butz², Katharina von Kriegstein¹; ¹Chair of Cognitive and Clinical Neuroscience, TU Dresden, Germany, ²Neuro-Cognitive Modeling, University of Tübingen, Germany

Voice identity recognition in auditory-only conditions is facilitated by knowing the face of the speaker. This effect is called the ‘face-benefit’. Based on neuroscience findings, we hypothesized that this benefit emerges from two factors: First, a generative world model integrates information from multiple senses to better predict the sensory dynamics. Second, the model substitutes absent sensory information, e.g., facial dynamics, with internal simulations. We have developed a deep generative model that learns to simulate such multisensory dynamics, developing latent speaker-characteristic contexts. We trained our model on synthetic audio-visual data of talking faces and tested its ability to recognize speakers from their voice only. We found that the model recognizes previously seen speakers better than previously unseen speakers when given their voice only. The modeling results confirm that multisensory simulations and predictive substitutions of missing visual inputs result in the face-benefit.

Keywords: multisensory learning speech voice world models

View Paper PDF