\(\def\mymacro{{\mathbf{\alpha,\beta,\gamma}}}\)
\(\def\va{{\mathbf{a}}}\)
\(\def\vb{{\mathbf{b}}}\)
\(\def\vc{{\mathbf{c}}}\)
\(\def\vd{{\mathbf{d}}}\)
\(\def\ve{{\mathbf{e}}}\)
\(\def\vf{{\mathbf{f}}}\)
\(\def\vg{{\mathbf{g}}}\)
\(\def\vh{{\mathbf{h}}}\)
\(\def\vi{{\mathbf{i}}}\)
\(\def\vj{{\mathbf{j}}}\)
\(\def\vk{{\mathbf{k}}}\)
\(\def\vl{{\mathbf{l}}}\)
\(\def\vm{{\mathbf{m}}}\)
\(\def\vn{{\mathbf{n}}}\)
\(\def\vo{{\mathbf{o}}}\)
\(\def\vp{{\mathbf{p}}}\)
\(\def\vq{{\mathbf{q}}}\)
\(\def\vr{{\mathbf{r}}}\)
\(\def\vs{{\mathbf{s}}}\)
\(\def\vt{{\mathbf{t}}}\)
\(\def\vu{{\mathbf{u}}}\)
\(\def\vv{{\mathbf{v}}}\)
\(\def\vw{{\mathbf{w}}}\)
\(\def\vx{{\mathbf{x}}}\)
\(\def\vy{{\mathbf{y}}}\)
\(\def\vz{{\mathbf{z}}}\)
\(\def\vmu{{\mathbf{\mu}}}\)
\(\def\vsigma{{\mathbf{\sigma}}}\)
\(\def\vtheta{{\mathbf{\theta}}}\)
\(\def\vzero{{\mathbf{0}}}\)
\(\def\vone{{\mathbf{1}}}\)
\(\def\vell{{\mathbf{\ell}}}\)
\(\def\mA{{\mathbf{A}}}\)
\(\def\mB{{\mathbf{B}}}\)
\(\def\mC{{\mathbf{C}}}\)
\(\def\mD{{\mathbf{D}}}\)
\(\def\mE{{\mathbf{E}}}\)
\(\def\mF{{\mathbf{F}}}\)
\(\def\mG{{\mathbf{G}}}\)
\(\def\mH{{\mathbf{H}}}\)
\(\def\mI{{\mathbf{I}}}\)
\(\def\mJ{{\mathbf{J}}}\)
\(\def\mK{{\mathbf{K}}}\)
\(\def\mL{{\mathbf{L}}}\)
\(\def\mM{{\mathbf{M}}}\)
\(\def\mN{{\mathbf{N}}}\)
\(\def\mO{{\mathbf{O}}}\)
\(\def\mP{{\mathbf{P}}}\)
\(\def\mQ{{\mathbf{Q}}}\)
\(\def\mR{{\mathbf{R}}}\)
\(\def\mS{{\mathbf{S}}}\)
\(\def\mT{{\mathbf{T}}}\)
\(\def\mU{{\mathbf{U}}}\)
\(\def\mV{{\mathbf{V}}}\)
\(\def\mW{{\mathbf{W}}}\)
\(\def\mX{{\mathbf{X}}}\)
\(\def\mY{{\mathbf{Y}}}\)
\(\def\mZ{{\mathbf{Z}}}\)
\(\def\mStilde{\mathbf{\tilde{\mS}}}\)
\(\def\mGtilde{\mathbf{\tilde{\mG}}}\)
\(\def\mGoverline{{\mathbf{\overline{G}}}}\)
\(\def\mBeta{{\mathbf{\beta}}}\)
\(\def\mPhi{{\mathbf{\Phi}}}\)
\(\def\mLambda{{\mathbf{\Lambda}}}\)
\(\def\mSigma{{\mathbf{\Sigma}}}\)
\(\def\tA{{\mathbf{\mathsf{A}}}}\)
\(\def\tB{{\mathbf{\mathsf{B}}}}\)
\(\def\tC{{\mathbf{\mathsf{C}}}}\)
\(\def\tD{{\mathbf{\mathsf{D}}}}\)
\(\def\tE{{\mathbf{\mathsf{E}}}}\)
\(\def\tF{{\mathbf{\mathsf{F}}}}\)
\(\def\tG{{\mathbf{\mathsf{G}}}}\)
\(\def\tH{{\mathbf{\mathsf{H}}}}\)
\(\def\tI{{\mathbf{\mathsf{I}}}}\)
\(\def\tJ{{\mathbf{\mathsf{J}}}}\)
\(\def\tK{{\mathbf{\mathsf{K}}}}\)
\(\def\tL{{\mathbf{\mathsf{L}}}}\)
\(\def\tM{{\mathbf{\mathsf{M}}}}\)
\(\def\tN{{\mathbf{\mathsf{N}}}}\)
\(\def\tO{{\mathbf{\mathsf{O}}}}\)
\(\def\tP{{\mathbf{\mathsf{P}}}}\)
\(\def\tQ{{\mathbf{\mathsf{Q}}}}\)
\(\def\tR{{\mathbf{\mathsf{R}}}}\)
\(\def\tS{{\mathbf{\mathsf{S}}}}\)
\(\def\tT{{\mathbf{\mathsf{T}}}}\)
\(\def\tU{{\mathbf{\mathsf{U}}}}\)
\(\def\tV{{\mathbf{\mathsf{V}}}}\)
\(\def\tW{{\mathbf{\mathsf{W}}}}\)
\(\def\tX{{\mathbf{\mathsf{X}}}}\)
\(\def\tY{{\mathbf{\mathsf{Y}}}}\)
\(\def\tZ{{\mathbf{\mathsf{Z}}}}\)
\(\def\gA{{\mathcal{A}}}\)
\(\def\gB{{\mathcal{B}}}\)
\(\def\gC{{\mathcal{C}}}\)
\(\def\gD{{\mathcal{D}}}\)
\(\def\gE{{\mathcal{E}}}\)
\(\def\gF{{\mathcal{F}}}\)
\(\def\gG{{\mathcal{G}}}\)
\(\def\gH{{\mathcal{H}}}\)
\(\def\gI{{\mathcal{I}}}\)
\(\def\gJ{{\mathcal{J}}}\)
\(\def\gK{{\mathcal{K}}}\)
\(\def\gL{{\mathcal{L}}}\)
\(\def\gM{{\mathcal{M}}}\)
\(\def\gN{{\mathcal{N}}}\)
\(\def\gO{{\mathcal{O}}}\)
\(\def\gP{{\mathcal{P}}}\)
\(\def\gQ{{\mathcal{Q}}}\)
\(\def\gR{{\mathcal{R}}}\)
\(\def\gS{{\mathcal{S}}}\)
\(\def\gT{{\mathcal{T}}}\)
\(\def\gU{{\mathcal{U}}}\)
\(\def\gV{{\mathcal{V}}}\)
\(\def\gW{{\mathcal{W}}}\)
\(\def\gX{{\mathcal{X}}}\)
\(\def\gY{{\mathcal{Y}}}\)
\(\def\gZ{{\mathcal{Z}}}\)
\(\def\sA{{\mathbb{A}}}\)
\(\def\sB{{\mathbb{B}}}\)
\(\def\sC{{\mathbb{C}}}\)
\(\def\sD{{\mathbb{D}}}\)
\(\def\sF{{\mathbb{F}}}\)
\(\def\sG{{\mathbb{G}}}\)
\(\def\sH{{\mathbb{H}}}\)
\(\def\sI{{\mathbb{I}}}\)
\(\def\sJ{{\mathbb{J}}}\)
\(\def\sK{{\mathbb{K}}}\)
\(\def\sL{{\mathbb{L}}}\)
\(\def\sM{{\mathbb{M}}}\)
\(\def\sN{{\mathbb{N}}}\)
\(\def\sO{{\mathbb{O}}}\)
\(\def\sP{{\mathbb{P}}}\)
\(\def\sQ{{\mathbb{Q}}}\)
\(\def\sR{{\mathbb{R}}}\)
\(\def\sS{{\mathbb{S}}}\)
\(\def\sT{{\mathbb{T}}}\)
\(\def\sU{{\mathbb{U}}}\)
\(\def\sV{{\mathbb{V}}}\)
\(\def\sW{{\mathbb{W}}}\)
\(\def\sX{{\mathbb{X}}}\)
\(\def\sY{{\mathbb{Y}}}\)
\(\def\sZ{{\mathbb{Z}}}\)
\(\def\E{{\mathbb{E}}}\)
\(\def\jac{{\mathbf{\mathrm{J}}}}\)
\(\def\argmax{{\mathop{\mathrm{arg}\,\mathrm{max}}}}\)
\(\def\argmin{{\mathop{\mathrm{arg}\,\mathrm{min}}}}\)
\(\def\Tr{{\mathop{\mathrm{Tr}}}}\)
\(\def\diag{{\mathop{\mathrm{diag}}}}\)
\(\def\vec{{\mathop{\mathrm{vec}}}}\)
\(\def\Kern{{\mathop{\mathrm{Kern}}}}\)
\(\def\llbracket{⟦}\)
\(\def\rrbracket{⟧}\)
Research Interests: Learning Algorithms Beyond the Gradient
This document collects my current research interests and related papers.
Interested in similar topics? Let's chat!
- Natural gradient descent and second-order optimization for LLMs [1]–[3], distributed training [4]
- Automatic differentiation [5], specifically computing higher-order derivatives for scientific ML [6], [7]
- Curvature/Covariance (Hessian, Fisher, …) matrix estimation [8]–[10] with structured matrices
- Curvature information outside optimization [11] (e.g. training data attribution [12], [13])
- Symbolic computing, compute graph surgery, ML compilers [14], [15]
- Randomized linear algebra [16], [17]
- Neural network parameter space symmetries [18], [19]
[1]
K. Jordan et al., “Muon: An optimizer for hidden layers in neural networks,” 2024.
[2]
N. Vyas et al., “Soap: Improving and stabilizing shampoo using adam,” Arxiv, 2025.
[3]
V. Gupta, T. Koren, and Y. Singer, “Shampoo: Preconditioned stochastic tensor optimization.” 2018.
[4]
H.-J. M. Shi et al., “A distributed data-parallel pytorch implementation of the distributed shampoo optimizer for training neural networks at-scale.” 2023.
[5]
A. Griewank and A. Walther, Evaluating derivatives: principles and techniques of algorithmic differentiation. SIAM, 2008.
[6]
M. J. Johnson, J. Bettencourt, D. Maclaurin, and D. Duvenaud, “Taylor-mode higher-order automatic differentiation.” 2021.
[7]
R. Li et al., “Forward laplacian: A new computational framework for neural network-based variational monte carlo.” 2023.
[8]
J. Martens and R. Grosse, “Optimizing neural networks with Kronecker-factored approximate curvature,” 2015.
[9]
T. George, C. Laurent, X. Bouthillier, N. Ballas, and P. Vincent, “Fast approximate natural gradient descent in a kronecker-factored eigenbasis,” Advances in neural information processing systems (neurips), 2018.
[10]
W. Lin, S. C. Lowe, F. Dangel, R. Eschenhagen, Z. Xu, and R. B. Grosse, “Understanding and improving shampoo and soap via kullback-leibler minimization,” Arxiv, 2025.
[11]
F. Dangel, R. Eschenhagen, W. Ormaniec, A. Fernandez, L. Tatzel, and A. Kristiadi, “Position: Curvature matrices should be democratized via linear operators,” Arxiv, 2025.
[12]
R. Grosse et al., “Studying large language model generalization with influence functions.” 2023.
[13]
P. W. Koh and P. Liang, “Understanding black-box predictions via influence functions,” 2017.
[14]
V. Roulet and A. Agarwala, “Per-example gradients: a new frontier for understanding and improving optimizers,” Arxiv, 2025.
[15]
F. Dangel, T. Siebert, M. Zeinhofer, and A. Walther, “Collapsing taylor mode automatic differentiation,” 2025.
[16]
A. Guzmán-Cordero, F. Dangel, G. Goldshlager, and M. Zeinhofer, “Improving energy natural gradient descent through woodbury, momentum, and randomization,” 2025.
[17]
A. Fernandez, F. Dangel, P. Hennig, and F. Schneider, “Sketching low-rank plus diagonal matrices,” Arxiv, 2025.
[18]
M. F. da Silva, F. Dangel, and S. Oore, “Hide & seek: Transformer symmetries obscure sharpness & riemannian geometry finds it,” 2025.
[19]
D. Kunin, J. Sagastuy-Brena, S. Ganguli, D. L. Yamins, and H. Tanaka, “Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics,” 2021.
Author: Felix Dangel
Created: 2025-11-04 Tue 14:28
Validate