Ilya Sutskever ML Study Guide

This post lists Ilya Ilya Sutskever’s ML study guide. For my own ML study guide, see this post.

When Ilya gave this list to John Carmack he reportedly said, ‘If you really learn all of these, you’ll know 90% of what matters today.’

There are a few versions of this list online, but the most authoratative I found was from Andrew Carr, an ex-OpenAI employee who found the list in OpenAI’s on-boarding docs (source). The most notable difference with this list is the source for Komogorov complexity, which is a chapter out of a book.

The Annotated Transformer. Sasha Rush, et al. Companion article to “Attention is All You Need”. [Blog] [GitHub]
Attention Is All You Need. Ashish Vaswani, et al. [ArXiv]
The First Law of Complexodynamics. Scott Aaronson. [Blog]
The Unreasonable Effectiveness of Recurrent Neural Networks. Andrej Karpathy. [Blog]
Understanding LSTM Networks. Christopher Olah. [Blog]
Recurrent Neural Network Regularization. Wojciech Zaremba, et al. [ArXiv]
Keeping Neural Networks Simple by Minimizing the Description Length of the Weights. Geoffrey E. Hinton and Drew van Camp. [PDF]
Pointer Networks. Oriol Vinyals, et al. [ArXiv]
ImageNet Classification with Deep Convolutional Neural Networks. Alex Krizhevsky, et al. [PDF]
Order Matters: Sequence to sequence for sets. Oriol Vinyals, et al. [ArXiv]
GPipe: Easy Scaling with Micro-Batch Pipeline Parallelism. Yanping Huang, et al. [ArXiv]
Deep Residual Learning for Image Recognition. Kaiming He, et al. [ArXiv]
Multi-Scale Context Aggregation by Dilated Convolutions. Fisher Yu and Vladlen Koltun. [ArXiv]
Neural Message Passing for Quantum Chemistry. Justin Gilmer, et al. [ArXiv]
Attention Is All You Need. Ashish Vaswani, et al. [ArXiv]
Neural Machine Translation by Jointly Learning to Align and Translate. Dzmitry Bahdanau, et al. [ArXiv]
Identity Mappings in Deep Residual Networks. Kaiming He, et al. [ArXiv]
A simple neural network module for relational reasoning. Adam Santoro, et al. [ArXiv]
Variational Lossy Autoencoder. Xi Chen, et al. [ArXiv]
Relational recurrent neural networks. Adam Santoro, et al. [ArXiv]
Quantifying the Rise and Fall of Complexity in Closed Systems: The Coffee Automaton. Scott Aaronson, et al. [ArXiv]
Neural Turing Machines. Alex Graves, et al. [ArXiv]
Deep Speech 2: End-to-End Speech Recognition in English and Mandarin. Dario Amodei, et al. [ArXiv]
Scaling Laws for Neural Language Models. Jared Kaplan, et al. [ArXiv]
A Tutorial Introduction to the Minimum Description Length Principle. Peter Grunwald. [ArXiv]
Machine Super Intelligence. Shane Legg. [Thesis PDF]
CS231n: Convolutional Neural Networks for Visual Recognition. [Course]
Elements of Information Theory, 2nd Edition, Ch. 14 Komogorov Complexity. Thomas M. Cover, Joy A. Thomas. [Book]

Note that there are additional papers on meta-learning that are missing. If anyone finds them, please let me know so that I can add them to the list. In the meantime, I will suggest two meta-learning papers here:

Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments. Al-Shedivat et al. [ArXiv]
Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. Chelsea Finn, Pieter Abbeel, Sergey Levine. [Project page]

Update 2024-12-05: I found an interesting blog post by Taro Langner which was written a few weeks before mine. He speculates on the missing meta-learning papers, which I am listing below:

Meta-Learning with Memory-Augmented Neural Networks Adam Santoro, et al. [ArXiv]
Prototypical Networks for Few-shot Learning Jake Snell, et al. [ArXiv]