My Blog

Sensitivity-Based Feature Discovery Without Sparsity Assumptions

Mon Sep 15 2025

Discovering monosemantic features in language models is a key challenge in mechanistic interpretability. Existing methods like sparse autoencoders (SAEs) rely on two main assumptions: (1) sparsity and (2) linear representation hypothesis. We propose an alternative method for discovering features based on a causal sensitivity score inspired by the notion of sensitivity in analysis. We introduce two key principles---(1) the _sensitivity hypothesis_ and (2) the _relative norm hypothesis_, a principled alternative for the linear representation hypothesis---and show how they naturally lead to a method for discovering features in language models. This formulation implicitly accounts for the effects of layer normalization in modern architectures, while explaining feature sparsity _without requiring it as an assumption_. We validate our method on the Pythia-70M model, finding that sensitivity-based features are slightly more interpretable than SAE features without requiring sparsity assumptions. Moreover, we also find that SAEs already discover "sensitive" features. However, our approach currently faces limitations: sequential rather than parallel discovery and an intriguing tendency to find "feature removal" directions that we address but do not fully understand.

Ramsey Theory is Fun: a Surprising Fact when $F (x, y) = F (y, x)$

Wed Feb 19 2025

A quick and surprising fact from Ramsey theory. For all symmetric functions $F (x, y) \to Y$ with a reasonably sized domain, then we can find a set $S$ such that for all $x, y \in S$ , $F (x, y) \neq Y$ of a logarithmic size (in the domain)!

Folding for Data Availability; Fun for All Sizes

Tue Mar 26 2024

In this post, we will explore a new technique for generating data availability proofs, primarily leveraging cryptographic folding and the Blake3 hash function.

Theorem Proving's Potential

Sun Dec 04 2022

Embedding spaces and AI, learning, and unifying programming and proving. Why I'm excited for theorem proving.

My Blog

Sensitivity-Based Feature Discovery Without Sparsity Assumptions

Ramsey Theory is Fun: a Surprising Fact when F(x,y)=F(y,x)

Folding for Data Availability; Fun for All Sizes

Theorem Proving's Potential

Ramsey Theory is Fun: a Surprising Fact when $F (x, y) = F (y, x)$