Jacobian Sparse Autoencoders, ICML 2025
Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison
SAEs have been widely applied to interpret the activations of neural networks, but we are primarily interested in the computations they perform. We propose Jacobian SAEs (JSAEs), which incentivize sparsity not only of the input and output activations of a model component, but also of the computation that connects them, which we formalize in terms of the Jacobian between input and output SAE latents.