Tracking SAE Features Across Layers

September, 2024

In the summer of 2024, I joined a SPAR project led by Stefan Heimerheim at Apollo Research, studying how sparse autoencoder features evolve across layers in large language models. SAEs decompose model activations into interpretable features, but relatively little was known about how those features relate to each other as information flows through the network.

The core challenge was scale: with roughly 300k SAE features and billions of potential edges to analyze, the work required designing a batched pipeline capable of accumulating statistics across 10 million tokens without materializing the full activation space. I implemented two similarity measures, Jaccard and decoder cosine similarity, at that scale, and built an ablation pipeline to test whether correlated feature pairs reflected genuine causal relationships rather than spurious co-occurrence. The results showed that highly-correlated pairs often but not always corresponded to true causal effects.

To make the resulting graphs explorable, I built a web-based feature browser inspired by Anthropic's interpretability visualizations, allowing you to navigate feature relationships across layers interactively.

The paper was accepted at the NeurIPS ATTRIB workshop, which felt like a meaningful milestone given that I had only recently pivoted into AI safety research full-time.

↑Back to Top