Tim Lawson

I am a PhD student funded by the UKRI CDT in Interactive AI at the University of Bristol, working on language modelling and interpretability. I have a Physics MSci from the University of Cambridge and seven years of experience in software engineering for data analytics. For my CV, please see LinkedIn; for recent work, see GitHub and HuggingFace. I occasionally write about my work (see posts).

Research

  • Learning to Skip the Middle Layers of Transformers

    Tim Lawson, Laurence Aitchison

    arXiv, GitHub

    We propose a novel gated architecture that dynamically skips a variable number of layers from the middle outward. Interpretability research has shown that the middle layers of Transformers exhibit greater redundancy and early layers aggregate information into token positions. We control residual norms with a ‘sandwich’ or ‘peri-layernorm’ scheme and gate sparsity with an adaptive regularization loss. Unfortunately, at the scales investigated, our approach does not improve the trade-off between cross-entropy and estimated FLOPs compared to dense baselines with fewer layers.

  • Jacobian Sparse Autoencoders, ICML 2025

    Lucy Farnik, Tim Lawson, Conor Houghton, Laurence Aitchison

    arXiv, GitHub

    SAEs have been widely applied to interpret the activations of neural networks, despite the fact that we are primarily interested in the computations that networks perform. We propose Jacobian SAEs (JSAEs), which incentivize sparsity not only of the input and output activations of a model component (e.g., an MLP sublayer), but also of the computation that connects them. We formalize this relationship in terms of the Jacobian between input and output SAE latents.

  • Sparse Autoencoders Can Interpret Randomly Initialized Transformers

    Thomas Heap, Tim Lawson, Lucy Farnik, Laurence Aitchison

    arXiv, GitHub

    We train SAEs on the activations of Transformers whose parameters are randomized (e.g., sampled i.i.d. from a Gaussian), and find the auto-interpretability scores of a sample of SAE latents are similar to the scores for unmodified Transformers. That is, SAE latents can appear interpretable through the lens of maximally activating examples even when the underlying model is randomized, because the input data (token embeddings) are inherently sparse and produce single-token activation patterns.

  • Residual Stream Analysis with Multi-Layer SAEs, ICLR 2025

    Tim Lawson, Lucy Farnik, Conor Houghton, Laurence Aitchison

    arXiv, GitHub, Poster

    We train a single sparse autoencoder (SAE) on the residual activation vectors from every layer of Transformer language models, and quantify how SAE latents are activated by inputs from multiple layers. Anthropic’s interpretability team have since described this approach as a ‘shared SAE’, and we compare it to their ‘crosscoders’ in the discussion section. Interestingly, we find that individual latents are often active at a single layer for a given token, but this layer may differ between tokens. We also try to reduce ‘feature drift’ by applying tuned-lens transformations to each layer.