Goal of this page
The Transformer seems to be one of the biggest landmarks in ML evolution. It's what's powering openGPT etc.
Goal here is to present the smartest breadcrumb trail to an intuitive-driven understanding of the theory behind Transformers.
Everyone's coming in with a different worldview of prior knowledge so "one-size-fits-all" is tricky.
(pi:) I'm gona go with what's working for me. Writing this up as I build my own understanding. I left ML in '17, picking it up Jan '23.
Please do feel welcome to add/restructure. But let's be careful not turn this into some useless encyclopedia with 100 hyperlinks. Goal is to present/optimize a learning-path.
I'll try to order the resources I found, so that we get an intuition/overview, then dig into the mechanism, and finally go code-level. Generally vids -> articles -> code-level & papers.
The original paper on arxiv. I don't recommend pushing thru this yet.
3 key concepts: Positional Encoding, Attention, Self-Attention
A masterpiece from HeduAI, complete with GameOfThrones refs. 4 short vids. Second one is where it kicks off.
Ark details the Self-Attention piece of the puzzle:
This doesn't sit on anything else. It's a fundamental component explained from the ground up.
Also he explains the sqrt(N) scaling-factor in a way nobody else does.
(vid: 4 x 10m) Rasa
Brilliant whiteboard expositions from Rasa. He builds upwards: (4 vids ~10min each)
Blog (15 min)
Justin Johnson -- uMichigan 22-lecture course DeepLearningForCV
Brilliant lecturer. Justin derives Transformer from RNNs. Watch lecture 12 (RNNs), then 13 (Attention -> Transformer). Might need an earlier vid on conv-nets too.
Tighter and more concise run-thru. I'm putting this lower on the list. By this point you should have dug into the concepts, and something like this will help pull it all together.
This article drills down to the matmult level. If you're super-highFunctioning you might be able to code a Transformer from this.
I think this follows nicely as it's the first non-video resource. We're moving into reading blogposts (which your brain can do at its own speed, so benefit here). We'll eventually move into code/papers.
I still wouldn't rate my chances to build one, let alone TRAIN one, from resources up to this point.
Nobody seems to be talking about TRAINING these things. I think the exposition is so exhausting, there's no energy left for discussing training.
A main criticism here is that the presentation is kinda flat (same level of detail applied everywhere), e.g. it gives no attempt at intuition for KQV. So seek to form intuition elsewhere. I'm not even convinced Jay has it (from this article).
NOTE: He links a paper+code effort here: http://nlp.seas.harvard.edu/annotated-transformer/
TODO: ^ check this out
NOTE: I'm still in the process of re-reviewing these, summarizing sentiment, ordering, etc.
Yannic's looking at the actual paper, so we're dropping down another level! He says some insightful things on KQV towards the end.
^ Now we're maybe ready to drop down to the code-level
Link to Karpathy's Discord in vid desc (strong community there). This vid is actually 7th in a `ZeroToHero` series.
Despite very few views, this is AFAICS one of the clearest videos. It's a little wooley in a couple of places towards the end, but does a great job of covering the whole thing.
AlgorithmicSimplicity presents a ground-up simplified understanding of Transformers]
He's building ground-up, starting with the concept of ConvNets. There is some insight here, and his approach is oblique to the standard approach. He doesn't even mention QKV. However, I'm not convinced this quite nails it.
This is the #1 hit on a Google search, but there are several wrong statements early on, so I hesitate to recommend this.
(Unordered but juicy)
https://towardsdatascience.com/transformers-explained-visually-not-just-how-but-why-they-work-so-well-d840bd61a9d3 ^ medium article on Transformers
Complete series here: https://www.youtube.com/playlist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r -- watch the prev one on RNNs (can skip LSTM at the end).
https://www.youtube.com/watch?v=ovB0ddFtzzA&ab_channel=mildlyoverfitted -- He dips into iPython to explain concepts in PyTorch. Also he shows tensor dimensions. He links to https://github.com/lucidrains/vit-pytorch and lucidrains is a BEAST -- about 20 xformer impls in one repo, and 230+ repos!
Not sure if juicy or not
TODO: Someone can review this?
Pytorch Transformers from Scratch (Attention is all you need) -- AladdinPersson codes up a xformer in an hour. Looks like he's reading from an offscreen resource, so not sure how valuable this is.
https://github.com/aladdinpersson/Machine-Learning-Collection <-- That's his GitHub. The source for this Transformer-tut is only 300 lines!
-> https://peterbloem.nl/blog/transformers ^ He recommends this source material! TODO: Check it out.
-> -> ... which recommends this: http://nlp.seas.harvard.edu/annotated-transformer/
https://github.com/jessevig/bertviz -- beautiful visualizations of transformer internal state
(NOTE: | Jointly Learning to Align and Translate with Transformer Models was recommended by Alexey Zaytsev on Yannic Kilcher's Discord, as an easier read than the original transformers paper)
14 Mar [CodeEmporium TransformersFromScratch | https://www.youtube.com/playlist?list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4] -- great schematic->codelevel
15 Mar Use GPT4, get it to be my tutor and ask me questions
15 Mar https://www.youtube.com/watch?v=fEVyfT-gLqQ&ab_channel=AlfredoCanziani <-- TODO check this, looks SUPER-promising
27 Sep (2023) - Revisited Karpathy - https://www.youtube.com/watch?v=qAb581l7lOc&ab_channel=ArtoftheProblem Interesting take here. He's observing that an attention layer can be considered as a layer where the weighting is dynamic depending on the inputs.
- Geometric Algebra Transformers https://arxiv.org/abs/2305.18415 https://www.youtube.com/watch?v=nPIRL-c88_E&ab_channel=SAIConference -> Alan Macdonald:
^ check his website too, e.g.: http://www.faculty.luther.edu/~macdonal/GAConstruct/GAConstruct.html
30 Sep - https://www.youtube.com/watch?v=qAb581l7lOc&ab_channel=ArtoftheProblem < 3 min! This is the most insightful take on a Transformer I've yet seen. Basically that the Attention layer acts as a dynamic-weighted NN.