What?
This article has two main findings:
- Given two local minimas in the loss surface of a Neural Net, it is possible two find a simple non-linear path between the two minimas, which maintains similarly low loss values along the path
- Given a single local minima, it is possible to find another minima in the neighborhood of the first. Here the authors propose Fast Geometric Ensembling, in which they sample new local miminas.
Why?
Both methods makes it possible to create ensembles of networks without using too much compute. This can be very useful for Bayesian MCMC-style inference. Other ideas are also suggested in the conclusion of the article, among which:
- derive new proposal distributions for variational inference, exploiting the flatness of these pathways
- accelerate the convergence, stability and accuracy of optimization procedures like SGD, by helping us understand the trajectories along which the optimizer moves
- find curves with particularly desirable properties, such as diversity of networks.
How?
Given two local minimas, the authors construct two types of paths with minimal loss: one consisting of two line segments, and the other made up of a quadratic bezier curve. Both have one free parameter (with the dimensionality of the parameter space) that controls the shape, and thus the path also spans a 2d-plane in the parameter space. The formula for the path made up of two line segments is
\[\phi_\theta(t)=\left\{\begin{array}{ll}2\left(t\theta+(0.5-t)\hat{w}_1\right),&0\leq t\leq0.5\\2\left((t-0.5)\hat{w}_2+(1-t)\theta\right),&0.5\leq t\leq1.\end{array}\right.\]The curve loss is approximated by sampling $t$ from $U(0,1)$, and then for each sample, take a gradient step down the loss surface.
And?
I am a bit in doubt why they sample from $t \sim U(0,1)$, rather than sampling such that any point on the path has the same probability. It seem pretty straight forward to me.
Another idea I got from reading the paper would be to use identifiable copies (whenever there are two dense layers) as the second local minima, rather than doing the Fast Geometric Sampling (which as far as I can understand samples solutions very close to the original solution).