What is the empirical Fisher ?
November 9, 2018
Some recent papers mention that they use the inverse of the "empirical Fisher" as a preconditioner. The main reason is its simplicity of use since it only requires gradients of the loss with respect to the parameters for each individual example. These are the same gradients as the ones we use to estimate our expected gradient when using SGD, as opposed to the true Fisher used in natural gradient, where the gradients that we need are gradients sampled from the distribution represented by our neural network.
The update using the "empirical Fisher" is:
Where is often estimated using its minibatch estimate, and is the (uncentered) covariance of the gradients, also estimated using a minibatch, or a running average. is the learning rate, and is a Tikhonov damping parameter.
1 What problem are we solving when using this update?
Claim: This update is solution to the following problem, up to a second order approximation:
Where we defined , and is a predefined scalar constant.
Proof: We start by writing the first order Taylor series expansion of :
Where hides the higher order terms. It is a function such that , or to put it into words, it will be negligible compared to the first order term as long as is not too big.
By replacing in the constraint we obtain:
In the second line we have hidden the cross product in .
We now remark that we can rewrite:
And so our minimization problem becomes:
Which can be solved e.g. using Lagrange multipliers, and we obtain the update:
Where is a scalar that we usually define as being the (constant) learning rate, but to be more precise it should be set so that the constraint is enforced. The role of is to make sure that regardless of the spectrum of , the update will not get too big, and make our second order approximation wrong.
What does it mean to be solving this minimization problem?
First, it means that we measure progress in the space of our loss function. It has the desirable effect of making this update invariant by reparametrization of the network, as long as is kept small.
Second, it means that we will encourage all examples to have their loss reduced by a similar amount, on average . Is this something desirable or not ? I don’t know but I am open to your suggestions!