Here, I will follow Neel Nanda’s tutorial on How to Become A Mechanistic Interpretability Researcher.
Keep in mind
Do not just read things. Mech interp is a fundamentally empirical science.
Stage 1: Learning the ropes
I already have a decent understanding of linear algebra and all the mathematical tools needed.
- I’ll start with reading on Transformers: Chapter 12 of Understanding Deep Learning Textbook
- and Anthropic’s paper on a Mathematical Framework for Transformer Circuits.
- Code a simple Transformer like GPT-2 from scratch. He suggests using ARENA Chapter 1.1
- Refer to Ferrando et al
- Code yourself activation patching
- linear probes
- Using SAEs
- Max Activating Dataset Examples