Poster Presentation

Search Papers | Poster Sessions | All Posters

Poster A113 in Poster Session A - Tuesday, August 6, 2024, 4:15 – 6:15 pm, Johnson Ice Rink

Evaluating and supervising vision models with multi-level similarity judgments

Lukas Muttenthaler^1,2,3, Frieda Born^1,2,4, Klaus Greff³, Thomas Unterthiner³, Andrew Lampinen³, Klaus-Robert Müller^1,2,3,5,6, Mike Mozer³; ¹Machine Learning Group, Technical University of Berlin, Germany, ²BIFOLD, Berlin Institute for the Foundations of Learning and Data, Berlin, Germany, ³Google DeepMind, ⁴Max Planck Institute for Human Human Development, Berlin, Germany, ⁵Department of Artificial Intelligence, Korea University, ⁶Max Planck Institute for Informatics, Saarbrücken, Germany

Vision foundation models are becoming increasingly pervasive. Despite their incredible success, it remains unclear to what degree they see the world the way humans do. A growing body of recent work investigates the alignment between human and model representations but has not systematically characterized this alignment across levels of conceptual abstraction. Here, we attempt to bridge this gap and collect a large human similarity judgment dataset of triplet odd-one-out choices on three levels of semantic abstraction: coarse-grained, fine-grained, and class-boundary. This multi-level behavioral dataset enables more nuanced comparisons between humans and computer vision models than has previously been possible. Models and people are best aligned on class-boundary and worst aligned on coarse-grained similarity judgments. Human alignment with various model types depends on the level of abstraction: image/text models match people best for superordinate categories, but self-supervised image models match best for fine-grained semantic categories. Our dataset facilitates the evaluation---and potentially the improvement---of vision foundation models.

Keywords: representation learning human behavior similarity spaces object concepts

View Paper PDF