Keynote & Tutorial

Wednesday, August 7, 2:40 - 3:20 pm, Kresge Hall (Keynote)

Wednesday, August 7, 4:30 - 6:15 pm, Kresge Hall (Tutorial)

Recent advances in interpretability of deep neural network models

Jack Lindsey¹, Matteo Alleman², Mitchell Ostrow³, Minni Sun² and Ankit Vishnubhotla⁴, ¹Anthropic, ²Columbia University, ³MIT, ⁴University of Chicago.

Understanding the mechanisms of computation inside neural networks is important for both neuroscientists and machine learning researchers. The past few years have seen rapid conceptual and technical advances in the field of “mechanistic interpretability” of deep neural networks used in machine learning, such as large language models. This talk will cover important findings in this field, focusing on the application of sparse autoencoders (SAEs) to decompose neural network activations into more interpretable components. We will begin by introducing the phenomenon of “superposition,” where networks can represent and compute with many more semantically meaningful "features" than they have neurons. We will then introduce SAEs as a means to extract features from superposition. Most of the talk will focus on empirical results obtained from applying SAEs to state-of-the-art large language models, which uncover a remarkably rich set of abstract concepts represented linearly in model activations. Finally, we will discuss causal interventions and other techniques for building on top of SAEs to construct "circuit" understanding of model computation.

The goal of the tutorial will be to give participants preliminary hands-on experience with sparse autoencoders, a popular tool used for mechanistic interpretability of large language models. Sparse autoencoders (SAEs) decompose model representations into a linear combination of sparsely active "features," which often correspond to semantically meaningful concepts. In this tutorial, participants will first implement and train an SAE on a toy model with a known ground-truth latent structure underlying its representations. This section will introduce some of key implementation considerations and challenges involved in training SAEs. Then participants will perform analyses with SAEs trained on small language models, to gain familiarity with the strategies used for evaluating and visualizing SAE outputs. Finally, participants will spend some time experimenting with published interactive tools like Neuronpedia for exploring SAE features on large language models, to gain intuition about the kind of information they provide about model representations at scale. The tutorial will be conducted using a Colab notebook written in Python, making use of the TransformerLens library for working with language model activations.