Foundations of Deep Learning Reading Group
Organizers: Michael Kapralov and Ola Svensson.
Recent advances in machine learning have led to breakthroughs in natural language processing, vision and more. Modern deep neural networks, while producing impressive results in practice, present many challenges: they use tremendous computational power, often produce results that are hard to interpret, and raise privacy concerns. The goal of our reading group is to cover some of the recent works that seek to establish a theoretical foundation for deep neural networks that will address these challenges.
Time and place: Wednesdays from 2.15pm to 4pm in BC 04.
Mailing list and calendar: There is a mailing list where the announcements of upcoming lectures will appear. To subscribe to this list, send an email (can be empty) to dl-foundations-subscribe@listes.epfl.ch.
List of topics:
- Tansformers: the transformer architecture, as well as (some of) the recent works on subquadratic attention mechanisms (xformers, State Space models, Hyena etc).
- Which functions can transformers learn: classes of functions that transformers can provably learn in-context.
- Graph neural networks: expressivity of graph neural networks (connections to the Weissfeiler-Lehman test), information bottlenecks in graph neural networks
- Reinforement learning: algorithmic foundations of reinforcement learning, including Markov Decision Processes and Monte Carlo Tree Search
We will partially follow the excellent course by Boaz Barak on foundations of deep learning.
Tentative schedule:
- Wed March 6, 2.15pm in BC04, Basics of the Transformer architecture
We will talk about the basics of neural networks, introduce the transformer architecture and discuss some approaches to reducing quadratic runtime of attention mechanisms.
Reading material:
The original transformer paper.
A great blog post by Peter Bloem on a simplified version of the transformer architecture.
Additional reading: transformers for vision.
- Wed March 13, 2.15pm in BC04, More o transformer architecture, and subquadratic attention mechanisms
We will talk about some recent approaches to reducing quadratic runtime of attention mechanisms (xformers).
Reading material:
Some xformers:
KDEFormer: approximating softmax via kernel density estimation.
Nyströmformer: approximating the attention matrix using the Nyström method.
- Wed March 20, 2.15pm in BC04, Subquadratic attention mechanisms through sketching
We will talk about Performer and Polysketchformer, which are approaches for approximating the attention matrix using a low rank matrix.
References:
Performer: applying Fourier features to approximate the attention matrix (see also blogpost)
- Wed March 27th, 2.15pm in BC04, State space models
We will talk about state space models, which attempt to match the amazing performance of transformers with alternative attention mechanisms.
Reading material:
Original state space paper: modeling long sequences with state space models
Hyena hierarchy: replacing the attention matrix with a sequence of diagonal multiplies in time and Fourier domain.
A nice ablation study on SSMs: significant simplifications to SSM can be made without sacrificing performance.
Mamba: the most recent state space model
- Fri April 19th, 1.00pm in INJ114, Which functions are representable/learnable by a transformer, or a state space model?
We will discuss empirical studies of which functions transformers learn well and which they do not, as well as some progress towards a theoretical explanation of these phenomena.
Reading material:
An empirical study of length generalization in transformers, with some nice formalism to explain the fundings (the RASP-L language).
In-context learning: an empirical study of in-context learnability of simple function classes such as linear functions and shallow neural nets. Talk video.
Linear regression can be learned in-context by a transformer: some analysis plus empirical validation.
- Wed May 3rd, 1.15pm in INJ114, Trained transformers can in-context learn linear regression
We will talk about this paper, which proves actual learnability (as opposed to representability) guarantees
- Wed May 8th, 2.15pm in BC04, Graph neural networks, part i: guest lecture by Ameya Velingker
Graphs are ubiquitous in machine learning, as objects and their relations can be modeled conveniently as sets of nodes and edges between them. For instance, a number of domains (e.g., molecules and proteins, quantum chemistry, road networks, social networks, etc) have underlying datasets that possess a natural graph structure. While much of modern machine learning has focused on tasks that involve sequence or grid data (e.g., images), a lot of real-world data cannot be represented effectively as a sequence or grid. The development of tools for machine learning on graph-structured data allows one to develop neural networks that are applicable to a broader range of domains.
The area of machine learning on graph-structured data has taken off over the last several years. We survey this area over multiple parts, giving an overview of fundamental results, research directions, and challenges, with a particular emphasis on the potential for the use of tools from theoretical computer science to address key challenges in the area. We begin with the basics of popular graph learning architectures such as Graph Neural Networks (GNNs) and Graph Transformers and move on to fundamental questions about expressivity (e.g., Weisfeiler-Lehman isomorphism test) and generalizability (e.g., VC dimension bounds) of model architectures. We also hope to cover practical challenges, such as over-squashing, over-smoothing, scalability, etc., faced by these architectures and highlight recent research on alleviating such issues.
- Wed May 15th, 2.15pm in BC04, Graph neural networks, part ii: guest lecture by Ameya Velingker