Caio Filippo Corro

Advice for machine learning students

Work in progress: I will try to regulary update this page with more information. Please, send me an e-mail if you feel like something is missing.

This page contains my personnal advice to (future) students in our Artificial Intelligence Master at Paris-Saclay, but I hope it will also be useful for other students. It is obviously highly biased toward sub-domains that I know, for example I will cover Natural Language Processing but not Computer Vision. This guide is not intended to be exhaustive but instead to focus on the smallest possible list of recommendations so that you don’t feel overwhelmed.


  1. Background: topics you are supposed to know — don’t worry if you have gaps in this list, the goal of this page is to help you identify what you need to focus on! Take your time in the first months of the master to work on subjects you don’t know yet;
  2. Helpful resources
  3. Searching and applying for an academic internship or PhD
  4. Overview of some ML for NLP research topics: a few references to give an overview of important topics in Machine Learning for Natural Language Processing for students who want to start a PhD.

Importantly, and I cannot stress it enough: do not pay for online courses. Do not pay for online courses. Do not pay for online courses. Do not pay for online courses. Do not pay for online courses. Do not pay for online courses. Do not pay for online courses. Do not pay for online courses. DO NOT PAY FOR ONLINE COURSES.

If you are a (future) student in our master and you fail to find one of the book/article listed here, please contact me by e-mail.



It is really important that you learn to formalize your ideas using math and pseudo-code. Pictures are good for the intuition but:

Of course there are many counter examples, e.g. probabilistic graphical models which have a well defined “graphical semantic”.

Let’s take a concrete exemple: how to present a multilayer perceptron (MLP)? (do not worry if you don’t know what a MLP is, you will learn that in the deep learning course) This simple and common deep learning architecture is often presented using this kind of figure:

Illustration of a neural network

Unfortunately, this picture doesn’t say much and it is not precise. I doesn’t fully specify what is the computation done by the model. A good way to present it would be:

Let $x \in \mathbb R^n$ be an input vector. A multilayer perceptron with a single hidden layer computes an output logit vector $w \in \mathbb R^m$ as follows:

$z = \tanh(A^{(1)} x + b^{(1)})$
$w = A^{(2)} z + b^{(2)}$


  • the hyperbolic tangent $\tanh$ is applied element wise,
  • $z \in \mathbb R^h$ is a hidden representation of size h,
  • $A^{(1)} \in \mathbb R^{h \times n}$ and $b^{(1)} \in \mathbb R^h$ are the parameter of the hidden projection,
  • $A^{(2)} \in \mathbb R^{m \times h}$ and $b^{(2)} \in \mathbb R^m$ are the parameter of the output projection.

If you are not familliar with matrix multiplication it can be helpful to draw the matrices to keep track of dimensions — do it on paper when you code so you are not lost!

Matrix multiplication illustration

This kind of drawing may become more important when you will need to rely on implicit broadcasting.


The main language you will need during this master is Python, which is really popular in machine learning and datascience in general. There are many other popular languages including C++, Rust and Julia, but you will have plenty of time to learn them when you will need to use them. For example, C++ is still used to design specific components in machine learning architectures that require speed and cannot be parallelized easily on GPU. However, in general and in the context of this master, focus on Python: learn how to write programs in Python, how it differs from languages you already know (e.g. loop definition in Python is unusual!) and play with it. It’s a computer science master, therefore strong programming skills is a requirement. I strongly advise to NOT spend time learning how to use fancy new languages or libraries. Use the basic libraries that works well and are well documented.

There are two important libraries that you will need to use:

Play with them, train yourself to write programs that manipulate vectors/matrices and display graphs. During the master you will probably use Pytorch and Pandas, but start with these two first and once you are proficient with them it will be easy for you to use the others.

A commonly used tool to prototype Python scripts is Jupyter. Install it and use it: most of your lab exercises will be based on Jupyter notebooks.

Linear algebra

Linear algebra is a recurring topic in machine learning and datascience: it will appear everywhere. Yes, everywhere. Having a good understanding of linear algebra will help to understand many algorithms and analysis techniques really quickly. You will save a lot of time right now and later. If you did not study it as an undegrad or if you need to refresh your memory, I can only advise you to watch and study Gilbert Strang’s course which available online for free:

My advice is to study topics covered until (and including) lecture 21. However, it is long and you may not have time to study everything. Don’t despair if you can’t do it all, watching only a few of the videos will already do you a big favor (but watch them in order!).


Alex Tsun uploaded a set of 5 minutes videos on probability that can be useful to understand the main concepts:


In computer science, we formalize and expose our work using mathematics and pseudo-code. You must learn to write reports than contains math and (pseudo-)code nicely, do not display screenshots from other people reports. The common tool for this is latex, there are many software available, I recommend:

The Overleaf website contains a nice Latex tutorial and an introduction on writing math in Latex. Write you tex files like you write code:

An important feature of Latex is that it takes care of the layout for you. Do not use \ \ to break lines or create a new paragraph: just leave one empty line for a “short paragraph break” and \paragraph{} for a “long paragraph break”.

A few tips:

It is important to write clearly, here are a few useful links:

Helpful resources

Here is a list of books you can use as a resource when you are searching for a specific topic related to machine learning:

We usually accept reports both in french and english (but check with the course teacher first!). However, if you are in a group with non-french speakers, you don’t really have a choice. The following websites can help you to improve your English writing:

Searching and applying for an academic internship or PhD

Open positions are usually advertised on mailing lists. You should register to them, note however that you will receive many emails (use Thunderbird to deal with your emails instead of a web client).

A few websites you can also check to find PhD positions:

Attach a CV when contacting a researcher. A few tips:

It is also best to have a public website with the same information. You can do something quickly with a static website generator like Hugo or Pelican. If you don’t want to self-host your website, you can also use free services like Cygale or Github pages.

Overview of some ML for NLP research topics

This section is intended for students who aim for a PhD after the master. Scientific research is exciting and fascinating, especially in machine learning which is a popular field these days. However, it is difficult to have a clear view of what are interesting and open research problems. I will try to list a very limited number of them that are highly biased toward what I am personnaly interested in. It is important to understand that “solving task X with machine learning” is vague and this kind of objective may limit the area in which you will contribute as a computer science researcher. This is especially true in natural language processing where obtaining good results for well studied languages is straightforward with large pretrained models like Bert and GPT. But first, the most important advice is the following:

During your PhD, work seriously but do not overwork — beware of burn-out, take care of yourself, seek help from your advisor and colleagues if you need.

I recommend reading the How to Find Research Problems page by Jason Eisner and references therein.

Structured prediction

Prediction problems you will study during the master will mainly be either (binary) classification and regression. However, there are many other problems including structured prediction problems where:

One of the most basic example in natural language processing is syntactic dependency parsing. Assume you have an input sentence of $n$ words. The goal is to predict the bilexical dependencies between words which can be represented as a spanning arborescence in a graph.

Dependency analysis of the sentence "the dog saw the cat"

A graph based parser for this problem works as follow:

  1. Build a graph where each word in the sentence is represented as a vertex;
  2. use a neural network to weight each possible arc (i.e. dependency between two words);
  3. compute the spanning arborescence of maximum weight.

Other problems include summing over all possible spanning arborescences which is required for training via a maximum likelihood loss. As such, structured prediction is a research field that includes many computer science topics, including graph theory, formal grammars1 and (combinatorial) optimization:

If you are interested in structured prediction, Jason Eisner’s research summary gives a good view of the field.

Designing neural architectures

In order to train and use a neural network, one must first design its architecture which can be motivated in many ways. For example, LSTMs have been designed to prevent the vanishing gradient problem and Transformer for their ability to be efficiently parallelized on GPU. However, they can also be designed to inject prior knowledge about the task so that they are more data efficient or interepretable:

Expressivity of neural architectures

Computer scientists have developed formal tools to classify the expressivity of formal languages and the hardness of problems, see for example wikipedia pages of Chomsky hierarchy and mildly context-sensitive grammar formalism for the case of formal grammars and complexity class for algorithms. Although the universal approximation theorem states that any function can be approximated by a neural network under certain conditions, researcher have been interested in better characterizations of architectures used in practice (e.g. with fixed width) and using other theoretical tools:

Efficient computation

Computational power is expensive. A good GPU for machine learning cost at least 5.000 euros, but a reasonable one for today’s neural architectures would be closer to 10.000 euros (e.g. NVidia v100 with 32Go of RAM). State-of-the-art models are usually trained on very large datasets with several GPUs or even TPUs. This is not sustainable and prevent the use of neural networks on small embedded devices. It is therefore important to develop more efficient models both in terms of memory and computational time. If you are interested in this subject, check Kenneth Heafield’s webpage.

Neural networks are usually implemented with autograd libraries like pytorch that take care of the computation of the gradient automatically. In theory, the backpropagation algorithm as a time complexity of the same order as the forward pass that was used to build the computational graph (more precisely: at most 5 times the cost of evaluating the forward function).25 However, in some specific cases this can be done more efficiently, for example:

Other methods to improve computation time include, among others:

Learning algorithms beyond gradient descent

First order optimization methods are popular in machine learning as they are scalable.

Natural Language Processing for low resource languages

Although many modern neural methods seem to work reasonably well, most of the ~6000-7000 languages spoken today lack resources for training. Therefore, there is a growing interest in natural language processing for low resource languages. As it is too difficult to give an overview of the field in a few lines, I advise you to check the following reviews:

Text generation

Text generation is an important problem in natural language processing that appears in many problems: machine translation, dialogue systems, paraphrase generation, text simplification, etc. The most common approach is based on autoregressive models where a sentence in generated one word after the other in sequential order. Unfortunately, computing the most probable output in this configuration is intractable, leading to the use of inexact greedy methods like beam search. A few research directions:

  1. Parsing Beyond Context-Free Grammars (Laura Kallmeyer) [return]
  2. The Simple Truth about Dependency and Phrase Structure Representations: An Opinion Piece (Owen Rambow, 2010) [return]
  3. A Minimal Span-Based Neural Constituency Parser (Stern, Andreas and Klein, 2017) [return]
  4. Span-based discontinuous constituency parsing: a family of exact chart-based algorithms with time complexities from $\mathcal O(n^6)$ down to $\mathcal O(n^3)$ (Corro, 2020) [return]
  5. Dual Decomposition for Parsing with Non-Projective Head Automata (Koo, Rush, Collins, Jaakkola and Sontag, 2010) [return]
  6. Efficient Discontinuous Phrase-Structure Parsing via the Generalized Maximum Spanning Arborescence (Corro, Le Roux and Lacroix, 2017) [return]
  7. Large Margin Methods for Structured and Interdependent Output Variables (Tsochantaridis, Joachims, Hofmann and Altun, 2005) [return]
  8. Graphical Models, Exponential Families, andVariational Inference (Wainwright and Jordan, 2008) [return]
  9. Group Equivariant Convolutional Networks (Cohen and Welling, 2016) [return]
  10. Permutation Equivariant Models for Compositional Generalization in Language (Gordon, Lopez-Paz, Baroni and Bouchacourt, 2020) [return]
  11. Structured Attention Networks (Kim, Denton, Hoang and Rush, 2017) [return]
  12. SparseMAP: Differentiable sparse structured inference (Niculae, Martins, Blondel and Cardie, 2018) [return]
  13. Learning Latent Trees with Stochastic Perturbations and Differentiable Dynamic Programming (Corro and Titov, 2019) [return]
  14. Differentiable Dynamic Programming for Structured Prediction and Attention (Mensch and Blondel, 2018) [return]
  15. On Differentiating Parameterized Argmin and Argmax Problems with Application to Bi-level Optimization (Gould, Fernando, Cherian, Anderson, Cruz and Guo, 2016) [return]
  16. OptNet: Differentiable Optimization as a Layer in Neural Networks (Amos and Kolter, 2017) [return]
  17. Differentiable Convex Optimization Layers (Agrawal, Amos, Barratt, Boyd, Diamond and Kolter, 2019) [return]
  18. SATNet: Bridging deep learning and logical reasoning using a differentiable satisfiability solver (Wang, Donti, Wilder and Kolter, 2019) [return]
  19. Neural Word Embedding as Implicit Matrix Factorization (Levy and Goldberg, 2014) [return]
  20. A Formal Hierarchy of RNN Architectures (Merrill, Weiss, Goldberg, Schwartz, Smith and Yahav, 2020) [return]
  21. Theoretical Limitations of Self-Attention in Neural Sequence Models (Hahn, 2020) [return]
  22. On the Ability and Limitations of Transformers to Recognize Formal Languages (Bhattamishra, Ahuja and Goyal, 2020) [return]
  23. Self-Attention Networks Can Process Bounded Hierarchical Languages (Yao, Peng, Papadimitriou and Narasimhan, 2021) [return]
  24. Limitations of Autoregressive Models and Their Alternatives (Lin, Jaech, Li, Gormley and Eisner, 2021) [return]
  25. On automatic differentiation (Griewank, 1989) [return]
  26. Complexity of exact gradient computation algorithms for recurrent neural networks (Williams, 1989) [return]
  27. An $\mathcal O(n^3)$ learning algorithm for fully recurrent neural networks (Schmidhuber, 1991) [return]
  28. A fixed size storage $\mathcal O(n^3)$ time complexity learning algorithm for fully recurrent continually running networks (Schmidhuber, 1992) [return]
  29. Higher-order Derivatives of Weighted Finite-state Machines (Zmigrod, Vieira and Cotterell, 2021) [return]
  30. GPU Kernels for Block-Sparse Weights (Gray, Radford and Kingma, 2017) [return]
  31. Learning sparse neural networks through $L_0$ regularization (Louizos, Welling and Kingma, 2018) [return]
  32. Efficient Transformers: A Survey (Tay, Dehghani Bahri and Metzler, 2020) [return]
  33. Binarized Neural Networks: Training Neural Networks with Weights and Activations Constrained to +1 or −1 (Courbariaux, Hubara, Soudry, El-Yaniv and Bengio, 2016) [return]
  34. Compressing Neural Machine Translation Models with 4-bit Precision (Aji and Heafield, 2020) [return]
  35. Extragradient with player sampling for faster Nash equilibrium finding (Jelassi, Enrich, Scieur, Mensch and Bruna, 2020) [return]
  36. Competitive Gradient Descent (Schäfer and Anandkumar, 2019) [return]
  37. Sparse Text Generation (Martins, Marinho and Martins, 2020) [return]
  38. Sparse Sequence-to-Sequence Models (Peters, Niculae and Martins, 2019) [return]
  39. If Beam Search is the Answer, What was the Question? (Meister, Cotterell and Vieira, 2020) [return]
  40. A Continuous Relaxation of Beam Search for End-to-End Training of Neural Sequence Models (Goyal, Neubig, Dyer and Berg-Kirkpatrick, 2018) [return]
  41. Bridging the Gap between Training and Inference for Neural Machine Translation (Zhang, Feng, Meng, You and Liu, 2019) [return]
  42. Diverse Beam Search for Improved Description of Complex Scenes (Vijayakumar, Cogswell, Selvaraju, Sun, Lee, Crandall and Batra, 2018) [return]
  43. Determinantal Beam Search (Meister, Forster and Cotterell, 2021) [return]
  44. Learning Neural Templates for Text Generation (Wiseman, Shieber and Rush) [return]
  45. Template Filling with Generative Transformers (Du, Rush and Cardie, 2021) [return]
  46. Sequence Modeling with Unconstrained Generation Order (Emelianenko, Voita and Serdyukov, 2019) [return]