Here you can find a collection of notebooks where I document my work on mechanistic interpretability for machine learning. Most of the work currently consists of replicating existing research, which I’m doing as a way to acquaint myself with the field.
I try my best to make the notebooks somewhat pedagogical, and accessible to anyone that has at least some basic notions of machine learning but is new to the field of mechanistic interpretability. I do this for a few reasons: first, because I myself fit that description pretty well, and so forcing myself to explain my work to not-so-distant-past-me actually helps me solidify what I’m learning.
Second, because one of the goals of mechanistic interpretability is to encourage machine learning practitioners to open up their black boxes and develop a deeper understanding of how models work on the inside. If my mech-interp work were only accessible to people already familiar with the field, it would defeat its own purpose.
Finally, I also think it’s just good practice to spend time documenting and sharing my work. It makes it much easier to revisit past work, creates a helpful accountability mechanism, and encourages a whole set of best practices by working in the open.