From The Age of Machine Intelligence
Jump to navigation Jump to search

Goal of this page

The Transformer seems to be one of the biggest landmarks in ML evolution. It's what's powering openGPT etc.

Goal here is to present the smartest breadcrumb trail to an intuitive-driven understanding of the theory behind Transformers.

Everyone's coming in with a different worldview of prior knowledge so "one-size-fits-all" is tricky.

(pi:) I'm gona go with what's working for me. Writing this up as I build my own understanding. I left ML in '17, picking it up Jan '23.

Please do feel welcome to add/restructure. But let's be careful not turn this into some useless encyclopedia with 100 hyperlinks. Goal is to present/optimize a learning-path.

I'll try to order the resources I found, so that we get an intuition/overview, then dig into the mechanism, and finally go code-level. Generally vids -> articles -> code-level & papers.


(.pdf) Attention is all you need

The original paper on arxiv. I don't recommend pushing thru this yet.

NOTE: The paper has associated which says it's out of date and current one is (pretty quiet since late 2021)

(vid 5m) 5 GoogleCloudTech -- Transformers, explained

3 key concepts: Positional Encoding, Attention, Self-Attention

(vids: 4 x 10min) HeduAI

A masterpiece from HeduAI, complete with GameOfThrones refs. 4 short vids. Second one is where it kicks off.

(vid: 40m) Ark

Ark details the Self-Attention piece of the puzzle:


This doesn't sit on anything else. It's a fundamental component explained from the ground up.

Also he explains the sqrt(N) scaling-factor in a way nobody else does.

(vid: 4 x 10m) Rasa

Brilliant whiteboard expositions from Rasa. He builds upwards: (4 vids ~10min each)





Blog (15 min)

Jay Mody builds Attention block in numpy

Justin Johnson -- uMichigan 22-lecture course DeepLearningForCV

Brilliant lecturer. Justin derives Transformer from RNNs. Watch lecture 12 (RNNs), then 13 (Attention -> Transformer). Might need an earlier vid on conv-nets too.

(vid: 13m) CodeEmporium

Tighter and more concise run-thru. I'm putting this lower on the list. By this point you should have dug into the concepts, and something like this will help pull it all together.

(BlogPost) Illustrated Transformer by Jay Alammar

This article drills down to the matmult level. If you're super-highFunctioning you might be able to code a Transformer from this.

I think this follows nicely as it's the first non-video resource. We're moving into reading blogposts (which your brain can do at its own speed, so benefit here). We'll eventually move into code/papers.

I still wouldn't rate my chances to build one, let alone TRAIN one, from resources up to this point.

Nobody seems to be talking about TRAINING these things. I think the exposition is so exhausting, there's no energy left for discussing training.

A main criticism here is that the presentation is kinda flat (same level of detail applied everywhere), e.g. it gives no attempt at intuition for KQV. So seek to form intuition elsewhere. I'm not even convinced Jay has it (from this article).

NOTE: He links a paper+code effort here:

TODO: ^ check this out

NOTE: I'm still in the process of re-reviewing these, summarizing sentiment, ordering, etc.

(vid: 30m) Yannic

Yannic's looking at the actual paper, so we're dropping down another level! He says some insightful things on KQV towards the end.

^ Now we're maybe ready to drop down to the code-level

(vid: 2h) Karpathy GPT from Scratch in code

Link to Karpathy's Discord in vid desc (strong community there). This vid is actually 7th in a `ZeroToHero` series.

(vid: 1h) Complete rundown

Despite very few views, this is AFAICS one of the clearest videos. It's a little wooley in a couple of places towards the end, but does a great job of covering the whole thing.

Probably not

AlgorithmicSimplicity (vid: 18m)

AlgorithmicSimplicity presents a ground-up simplified understanding of Transformers]

He's building ground-up, starting with the concept of ConvNets. There is some insight here, and his approach is oblique to the standard approach. He doesn't even mention QKV. However, I'm not convinced this quite nails it.


This is the #1 hit on a Google search, but there are several wrong statements early on, so I hesitate to recommend this.

(Unordered but juicy)

... ^ medium article on Transformers

Complete series here: -- watch the prev one on RNNs (can skip LSTM at the end). -- He dips into iPython to explain concepts in PyTorch. Also he shows tensor dimensions. He links to and lucidrains is a BEAST -- about 20 xformer impls in one repo, and 230+ repos!

Not sure if juicy or not

Pascal Poupart (Waterloo Uni) CS480/680 Lecture 19: Attention and Transformer Networks

TODO: Someone can review this?


Aladdin Persson

Pytorch Transformers from Scratch (Attention is all you need) -- AladdinPersson codes up a xformer in an hour. Looks like he's reading from an offscreen resource, so not sure how valuable this is. <-- That's his GitHub. The source for this Transformer-tut is only 300 lines!

-> ^ He recommends this source material! TODO: Check it out.

-> -> ... which recommends this:

Misc -- beautiful visualizations of transformer internal state

(NOTE: | Jointly Learning to Align and Translate with Transformer Models was recommended by Alexey Zaytsev on Yannic Kilcher's Discord, as an easier read than the original transformers paper)


14 Mar [CodeEmporium TransformersFromScratch |] -- great schematic->codelevel

15 Mar Use GPT4, get it to be my tutor and ask me questions

15 Mar <-- TODO check this, looks SUPER-promising

27 Sep (2023) - Revisited Karpathy - Interesting take here. He's observing that an attention layer can be considered as a layer where the weighting is dynamic depending on the inputs.

- Geometric Algebra Transformers -> Alan Macdonald:


^ check his website too, e.g.:

30 Sep - < 3 min! This is the most insightful take on a Transformer I've yet seen. Basically that the Attention layer acts as a dynamic-weighted NN.