-
Sparse Autoencoders Can Interpret Randomly Initialized Transformers
Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison
We train SAEs on the activations of transformers whose parameters are randomized (e.g., sampled i.i.d. from a Gaussian), and find the auto-interpretability scores of a sample of SAE latents are similar to the unmodified transformers. That is, SAE latents may appear interpretable through the lens of maximally activating examples even when the underlying model is randomized, because the input data (token embeddings) is inherently sparse and produces single-token activation patterns.
-
Residual Stream Analysis with Multi-Layer SAEs
We train a single sparse autoencoder (SAE) on the residual stream activation vectors from every layer of transformer language models, and quantify how SAE latents are activated by inputs from multiple layers. Interestingly, we find that individual latents are often active at a single layer for a given token, but this layer may differ between tokens.
Effectively, we build on Yun et al. (2021), who learned a sparse dictionary for multiple layers in the residual stream of BERT by iterative optimization. Lindsey et al. (2024) have since described this approach as a ‘shared SAE’, and we compare it to their ‘crosscoders’ in the discussion. We also try to reduce ‘feature drift’ by applying tuned-lens transformations to the activations at each layer (Belrose et al., 2023).