# Research Topics: Deep Learning Beyond the Gradient

## Table of Contents

(Download as pdf)

**Keywords:** Automatic differentiation, Hessian, Fisher, K-FAC, second-order optimization, Newton's method

**TL;DR:** Contemporary deep learning (DL) algorithms rely on the gradient. During my PhD, I developed tools to efficiently compute quantities beyond the gradient—higher-order information—like gradient statistics over a mini-batch and Hessian approximations. This information enables a richer view onto the loss landscape, quantifying its noise (stochasticity) and curvature (geometry). During my Postdoc, I want to **explore the potential of higher-order information for building more powerful DL methods** and further **improve automatic differentiation**.

**Interested in similar topics? Let's chat!**

No time to read on? Don't worry. I'll try to find and reach out to you ;)

(details below, feel free to skip)

## Improving Automatic Differentiation Frameworks

Next-generation ML frameworks should be able to seamlessly and efficiently compute not only the gradient, but also higher-order information. Currently, their computation often requires workarounds that are inefficient or complicated to implement, reducing the availability to users. I want to **identify shortcomings in existing AD libraries to improve their design and performance** and make higher-order information accessible to the community. Ideas:

- Make frameworks more aware of (i) linear algebra structure and (ii) hardware to automate many numerical tricks and optimizations (for instance optimizing contraction schedules, partial operand access/slicing, \(\dots\)).
- Develop representations for higher-order derivatives that simplify graph analysis and optimization.
- Explore the implications of new concepts, for instance distinguishing between co- and contra-variant indices of tensors rather than treating them as multi-dimensional arrays [1].

## Establishing Second-order Methods

**Note:** 'Second-order method' might refer to optimization, or more broadly any method motivated by a quadratic Taylor approximation (for instance for model compression [2] or Laplace approximations [3]).

DL is dominated by first-order (gradient-based) methods. Exploring methods that rely on other information has been prohibitively expensive and complicated until recently due to software constraints. I want to **investigate the potential of higher-order information such as curvature and noise for DL applications**. Ideas:

- Identify challenges for second-order optimization methods in the DL regime and work on fixes.
- What do "Newton's method" and "air travel" have in common? Both are very fast, but their worst-case is bad! (slide 97) Is noise a problem for their stability in the mini-batch setting? Can we develop techniques to improve stability, for instance removing the need for damping (fractional NGD [4]), applying momentum to Newton steps (determinantal averaging [5]), or designing other safeguard heuristics?
- Is there theory on the impact of noise in the convex setting?
- Is their implicit bias a problem for generalization?
- Reducing computational cost, for instance with 'mixed training' (split NN parameters into two groups, train one group with a first-order method, the other with a second-order method).

- Identify new tasks where higher-order information is useful (or even required as first-order methods fail).
- Hyperparameter adaptation with higher-order information

## Empirical Studies and Novel Approximations of Higher-order Information

Empirical investigations of DL quantities have observed pronounced patterns, for instance in the Hessian's spectrum [6]–[8]. I want to **investigate and understand the origin of structure in higher-order information**. This enables the development of efficient representations that specifically tackle the approximation of relevant terms. Ideas:

- Using tiny subspace, class-structure, and layer structure for new approximations. Numerous empirical works observed the Hessian's eigenvalue spectrum to cluster into groups [6]–[8]. The sizes of those groups are affected by the DNN's output dimension \(C\) (number of classes). [9] observed that after a few steps of training, the gradient resides mostly in the first group—the Hessian's top-\(C\) eigenspace. Keeping track of interactions between gradient and Hessian can therefore be sped up by projection onto that tiny subspace.
- [10] attributed these groups to terms in a hierarchical decomposition of the Hessian. The term responsible for the top-\(C\) eigenvalues can be computed relatively efficiently. The tiny subspace seems to not change much during training. This allows for constructing light-weight, high quality Hessian approximations that can be recycled in later iterations (momentum).

- Understand why using gradients to approximate curvature seems to work in DL. For instance: The empirical Fisher (EF) is a popular approximation of the Fisher/Hessian built from gradients. [11] shows that, in general, it can be dangerous to use the EF instead of the Fisher for optimization. Empirically, however, the EF performs well in DL applications [2]. Also, the EF and Fisher/GGN look quite similar for a DNN (visually). Maybe both matrices share important properties in the DL setting, despite the concerns of [11]?
- Investigate evolution during training for phase-dependent algorithms. By monitoring such quantities during training, one can identify different regimes and design specialized methods for each [12].

## References

*Advances in neural information processing systems (neurips)*, 2019.