Transformers
Goal of this page
The Transformer seems to be one of the biggest landmarks in ML evolution. It's what's powering openGPT etc.
Goal here is to present the smartest breadcrumb trail to an intuitive-driven understanding of the theory behind Transformers.
Everyone's coming in with a different worldview of prior knowledge so "one-size-fits-all" is tricky.
(pi:) I'm gona go with what's working for me. Writing this up as I build my own understanding. I left ML in '17, picking it up Jan '23.
Please do feel welcome to add/restructure. But let's be careful not turn this into some useless encyclopedia with 100 hyperlinks. Goal is to present/optimize a learning-path.
I'll try to order the resources I found, so that we get an intuition/overview, then dig into the mechanism, and finally go code-level. Generally vids -> articles -> code-level & papers.
Resources
(.pdf) Attention is all you need
The original paper on arxiv. I don't recommend pushing thru this yet.
NOTE: The paper has associated https://github.com/tensorflow/tensor2tensor which says it's out of date and current one is https://github.com/google/trax (pretty quiet since late 2021)
(vid 5m) 5 GoogleCloudTech -- Transformers, explained
3 key concepts: Positional Encoding, Attention, Self-Attention
(vids: 4 x 10min) HeduAI
A masterpiece from HeduAI, complete with GameOfThrones refs. 4 short vids. Second one is where it kicks off.
(vid: 40m) Ark
Ark details the Self-Attention piece of the puzzle:
This doesn't sit on anything else. It's a fundamental component explained from the ground up.
Also he explains the sqrt(N) scaling-factor in a way nobody else does.
(vid: 4 x 10m) Rasa
Brilliant whiteboard expositions from Rasa. He builds upwards: (4 vids ~10min each)
Blog (15 min)
Jay Mody builds Attention block in numpy
Justin Johnson -- uMichigan 22-lecture course DeepLearningForCV
Brilliant lecturer. Justin derives Transformer from RNNs. Watch lecture 12 (RNNs), then 13 (Attention -> Transformer). Might need an earlier vid on conv-nets too.
(vid: 13m) CodeEmporium
Tighter and more concise run-thru. I'm putting this lower on the list. By this point you should have dug into the concepts, and something like this will help pull it all together.
(BlogPost) Illustrated Transformer by Jay Alammar
This article drills down to the matmult level. If you're super-highFunctioning you might be able to code a Transformer from this.
I think this follows nicely as it's the first non-video resource. We're moving into reading blogposts (which your brain can do at its own speed, so benefit here). We'll eventually move into code/papers.
I still wouldn't rate my chances to build one, let alone TRAIN one, from resources up to this point.
Nobody seems to be talking about TRAINING these things. I think the exposition is so exhausting, there's no energy left for discussing training.
A main criticism here is that the presentation is kinda flat (same level of detail applied everywhere), e.g. it gives no attempt at intuition for KQV. So seek to form intuition elsewhere. I'm not even convinced Jay has it (from this article).
NOTE: He links a paper+code effort here: http://nlp.seas.harvard.edu/annotated-transformer/
TODO: ^ check this out
NOTE: I'm still in the process of re-reviewing these, summarizing sentiment, ordering, etc.
(vid: 30m) Yannic
Yannic's looking at the actual paper, so we're dropping down another level! He says some insightful things on KQV towards the end.
https://www.tensorflow.org/text/tutorials/transformer
^ Now we're maybe ready to drop down to the code-level
(vid: 2h) Karpathy GPT from Scratch in code
Link to Karpathy's Discord in vid desc (strong community there). This vid is actually 7th in a `ZeroToHero` series.
(vid: 1h) Complete rundown
Despite very few views, this is AFAICS one of the clearest videos. It's a little wooley in a couple of places towards the end, but does a great job of covering the whole thing.
Probably not
AlgorithmicSimplicity (vid: 18m)
AlgorithmicSimplicity presents a ground-up simplified understanding of Transformers]
He's building ground-up, starting with the concept of ConvNets. There is some insight here, and his approach is oblique to the standard approach. He doesn't even mention QKV. However, I'm not convinced this quite nails it.
TheA.I.Hacker-MichaelPhi
This is the #1 hit on a Google search, but there are several wrong statements early on, so I hesitate to recommend this.
(Unordered but juicy)
...
https://towardsdatascience.com/transformers-explained-visually-not-just-how-but-why-they-work-so-well-d840bd61a9d3 ^ medium article on Transformers
https://jordanlazzaro.github.io/posts/transformers-in-a-nutshell/
https://towardsdatascience.com/illustrated-self-attention-2d627e33b20a
Complete series here: https://www.youtube.com/playlist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r -- watch the prev one on RNNs (can skip LSTM at the end).
https://www.youtube.com/watch?v=ovB0ddFtzzA&ab_channel=mildlyoverfitted -- He dips into iPython to explain concepts in PyTorch. Also he shows tensor dimensions. He links to https://github.com/lucidrains/vit-pytorch and lucidrains is a BEAST -- about 20 xformer impls in one repo, and 230+ repos!
Not sure if juicy or not
Pascal Poupart (Waterloo Uni) CS480/680 Lecture 19: Attention and Transformer Networks
TODO: Someone can review this?
...
https://neptune.ai/blog/comprehensive-guide-to-transformers
Aladdin Persson
Pytorch Transformers from Scratch (Attention is all you need) -- AladdinPersson codes up a xformer in an hour. Looks like he's reading from an offscreen resource, so not sure how valuable this is.
https://github.com/aladdinpersson/Machine-Learning-Collection <-- That's his GitHub. The source for this Transformer-tut is only 300 lines!
-> https://peterbloem.nl/blog/transformers ^ He recommends this source material! TODO: Check it out.
-> -> ... which recommends this: http://nlp.seas.harvard.edu/annotated-transformer/
Misc
https://en.wikipedia.org/wiki/Transformer_(machine_learning_model)
https://en.wikipedia.org/wiki/Attention_(machine_learning)
https://en.wikipedia.org/wiki/Word_embedding
https://github.com/jessevig/bertviz -- beautiful visualizations of transformer internal state
(NOTE: | Jointly Learning to Align and Translate with Transformer Models was recommended by Alexey Zaytsev on Yannic Kilcher's Discord, as an easier read than the original transformers paper)
Diary
14 Mar [CodeEmporium TransformersFromScratch | https://www.youtube.com/playlist?list=PLTl9hO2Oobd97qfWC40gOSU8C0iu0m2l4] -- great schematic->codelevel
15 Mar Use GPT4, get it to be my tutor and ask me questions
15 Mar https://www.youtube.com/watch?v=fEVyfT-gLqQ&ab_channel=AlfredoCanziani <-- TODO check this, looks SUPER-promising
27 Sep (2023) - Revisited Karpathy - https://www.youtube.com/watch?v=qAb581l7lOc&ab_channel=ArtoftheProblem Interesting take here. He's observing that an attention layer can be considered as a layer where the weighting is dynamic depending on the inputs.
- Geometric Algebra Transformers https://arxiv.org/abs/2305.18415 https://www.youtube.com/watch?v=nPIRL-c88_E&ab_channel=SAIConference -> Alan Macdonald:
- https://www.youtube.com/watch?v=srwoPQfWWS8&list=PLLvlxwbzkr7igd6bL7959WWE7XInCCevt&ab_channel=AlanMacdonald
^ check his website too, e.g.: http://www.faculty.luther.edu/~macdonal/GAConstruct/GAConstruct.html
30 Sep - https://www.youtube.com/watch?v=qAb581l7lOc&ab_channel=ArtoftheProblem < 3 min! This is the most insightful take on a Transformer I've yet seen. Basically that the Attention layer acts as a dynamic-weighted NN.