Here, I will follow Neel Nanda’s tutorial on How to Become A Mechanistic Interpretability Researcher.
Keep in mind
Do not just read things. Mech interp is a fundamentally empirical science.
Stage 1: Learning the ropes
I already have a decent understanding of linear algebra and all the mathematical tools needed.
- I’ll start with reading on Transformers: Chapter 12 of the Understanding Deep Learning Textbook
- Read Anthropic’s paper on a Mathematical Framework for Transformer Circuits. (Kind of too complicated for me right now)
- Code a simple Transformer like GPT-2 from scratch. I used Neel Nanda’s Google Colab template for this task
- Refer to Ferrando et al
- Code yourself activation patching
- linear probes
- Using SAEs
- Max Activating Dataset Examples