Learning to Skip the Middle Layers of Transformers
Tim Lawson, Laurence Aitchison
We propose a novel gated architecture that dynamically skips a variable number of layers from the middle outward. Interpretability research has shown that the middle layers of Transformers exhibit greater redundancy, and early layers aggregate information into token positions. We control residual norms with a ‘sandwich’ or ‘peri-layernorm’ scheme and gate sparsity with an adaptive regularization loss. Unfortunately, at the scales investigated, our approach does not improve the trade-off between cross-entropy and estimated FLOPs compared to dense baselines with fewer layers.