My Blog
Sensitivity-Based Feature Discovery Without Sparsity Assumptions
Discovering monosemantic features in language models is a key challenge in mechanistic interpretability. Existing methods like sparse autoencoders (SAEs) rely on two main assumptions: (1) sparsity and (2) linear representation hypothesis. We propose an alternative method for discovering features based on a causal sensitivity score inspired by the notion of sensitivity in analysis. We introduce two key principles---(1) the _sensitivity hypothesis_ and (2) the _relative norm hypothesis_, a principled alternative for the linear representation hypothesis---and show how they naturally lead to a method for discovering features in language models. This formulation implicitly accounts for the effects of layer normalization in modern architectures, while explaining feature sparsity _without requiring it as an assumption_. We validate our method on the Pythia-70M model, finding that sensitivity-based features are slightly more interpretable than SAE features without requiring sparsity assumptions. Moreover, we also find that SAEs already discover "sensitive" features. However, our approach currently faces limitations: sequential rather than parallel discovery and an intriguing tendency to find "feature removal" directions that we address but do not fully understand.

Ramsey Theory is Fun: a Surprising Fact when
A quick and surprising fact from Ramsey theory. For all symmetric functions
Folding for Data Availability; Fun for All Sizes
In this post, we will explore a new technique for generating data availability proofs, primarily leveraging cryptographic folding and the Blake3 hash function.

Theorem Proving's Potential
Embedding spaces and AI, learning, and unifying programming and proving. Why I'm excited for theorem proving.
