Search Papers | Poster Sessions | All Posters

Poster A113 in Poster Session A - Tuesday, August 6, 2024, 4:15 – 6:15 pm, Johnson Ice Rink

Evaluating and supervising vision models with multi-level similarity judgments

Lukas Muttenthaler1,2,3, Frieda Born1,2,4, Klaus Greff3, Thomas Unterthiner3, Andrew Lampinen3, Klaus-Robert Müller1,2,3,5,6, Mike Mozer3; 1Machine Learning Group, Technical University of Berlin, Germany, 2BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany, 3Google DeepMind, 4Max Planck Institute for Human Human Development, Berlin, Germany, 5Department of Artificial Intelligence, Korea University, 6Max Planck Institute for Informatics, Saarbrücken, Germany

Vision foundation models are becoming increasingly pervasive. Despite their incredible success, it remains unclear to what degree they see the world the way humans do. A growing body of recent work investigates the alignment between human and model representations but has not systematically characterized this alignment across levels of conceptual abstraction. Here, we attempt to bridge this gap and collect a large human similarity judgment dataset of triplet odd-one-out choices on three levels of semantic abstraction: coarse-grained, fine-grained, and class-boundary. This multi-level behavioral dataset enables more nuanced comparisons between humans and computer vision models than has previously been possible. Models and people are best aligned on class-boundary and worst aligned on coarse-grained similarity judgments. Human alignment with various model types depends on the level of abstraction: image/text models match people best for superordinate categories, but self-supervised image models match best for fine-grained semantic categories. Our dataset facilitates the evaluation---and potentially the improvement---of vision foundation models.

Keywords: representation learning human behavior similarity spaces object concepts 

View Paper PDF