# The algebra of second order methods in neural networks

October 29, 2017

This note gives the derivations for the inverse of 2 different but related 2nd order matrices: the Fisher Information Matrix, and the Gauss-Newton approximation of the Hessian. In particular we highlight 2 centering properties that follow from the local structure of those matrices:

- we should always use a centered update for weight matrices, even if it does not follow the gradient direction (see section 4.2↓)
- we should normalize using the (centered) covariance matrix of the activation of each layer (see section 3.2↓)

Along the way, we describe the derivation of an approximate method using the properties of the Kronecker product known as KFAC [2] with corresponding parameter updates, and we give a motivation for the centering trick used in Natural Neural Networks [1].

# 1 Notations and problem statement

We denote by the output of a fully connected neural network parametrized by its weight matrices and bias vectors grouped in a vector of paramters . Let us denote by a loss function between the value given by the model and the true value . is the empirical risk on a train set of examples: where we denoted to simplify notations.

Suppose a second order update . can be the Hessian matrix or an approximation given by Gauss Newton. can also be the Fisher Information Matrix and in this case the update is called the natural gradient. By writing the expression for Gauss-Newton and Fisher, we observe that they share a similar structure:

is the jacobian matrix of the output of the network, with respect to the parameters . It is of size . For a small change it is a first order measure of the change in the value of or more precisely . The expression for the Fisher Information Matrix is given by [3]. Without loss of generality we denote both matrices by:

# 2 Local expression for the matrix

The matrix has size . For a typical neural network with several millions of parameters it is untractable to store and to invert. We usually approximate it as block diagonal, where each block is a square matrix of the size of the number of scalar parameter values for a layer. With this structure, we can invert each block separately and apply the update layer by layer: . Let us now give an exact expression for this smaller matrix and its inverse . We call it local it the sense that it is local to a layer.

## 2.1 Stacking the parameters

The computation made by a layer is given by . The parameters for this layer are a matrix and a vector . But in order to write a concise expression for we need to stack them into a vector so that the gradient is a vector and writing makes sense. To this end, we use the operator that stacks the column of a matrix into a vector, i.e. for a matrix:

Our full vector of parameters becomes:

## 2.2 Expressions for the jacobians

We now focus on the block for layer . We require an expression for . By the chain rule we separate it into a back propagated contribution and a local contribution:

To obtain an exact expression for we will use once again with the property that where is the Kronecker product:

In the second line we used the fact that is a vector and thus . We also introduced the identity matrix of the same size of . From eq \ref{eq:flattened_linear} we can directly read the jacobians:

Now using :

## 2.3 Expression for the block

Getting back to the block we get a simple expression:

In eq \ref{eq:befsim} we added as it does not change anything. Between eq \ref{eq:befsim} and \ref{eq:aftsim} we used the fact that when have corresponding sizes (i.e. the products and make sense).

## 2.4 Discussion

We obtained an *exact* expression for the block corresponding to layer
:

It is an expectation of Kronecker products. Note that we can not swap the expectation and the Kronecker products, and thus while the expression in eq \ref{eq:exact} is exact, the one used in KFAC [2] is an approximation.

In eq \ref{eq:exact} we denoted by the contribution that is backpropagated, and by a contribution that is local to the parameters of the layer.

# 3 Inverting the matrix

## 3.1 KFAC drill-down

Exactly inverting this matrix can still be untractable for typical neural networks. An approximation that is easier to manipulate is proposed in KFAC [2]. The key property that we are after here is that for 2 invertible matrices and we have that . It becomes:

The residual resembles a covariance between both terms:

The conditions under which it is negligible have not been extensively studied, or at least published to the best of our knowledge. We can however remark that if one part is close to then the expected product will be small. This is achieved for instance if is small for all the data generating distribution (recall that and depend on ). To put it into words if the value of does not vary much for all training examples. By symmetry we can make a similar argument for .

## 3.2 KFAC inversion

We now have a factorized approximate expression for :

Note that while the derivations proposed in KFAC use a single vector for all parameters of the layer , we explicitely separated the weight matrix and the bias in section 2.2↑ which gives a slightly different expression. Thus the matrix is separated into 2 blocks: 2 blocks on the diagonal that correspond to the weight matrix (block ) and the bias (block ), and 2 cross-terms that explicit their interactions.

We will see that separating the bias gives a nicer interpretation with a covariance matrix (as opposed to non-centered statistics).

We can now use the property . can not be be further simplified, so the next part is to obtain an expression for :

We can use the formula for inverting a block matrix (see Wikipedia:Block Matrix). We denote by and we get:

Note that is the covariance of : . It is centered (we substract ) which follows from the block matrix inversion formula, which in turns follows from the fact that we separated the bias. This motivates the use of centered statistics in second order inspired algorithms.

# 4 Writing the update

## 4.1 Derivation

Now that we have an expression for we can write the product required to make an update . In section 2.2↑ we wrote an expression for the jacobians . By a similar analysis we can write the gradients :

Using the same expressions as in 2.2↑ we can simplify :

Multiplying together with we get the product :

In the first line we can read the update for (in fact its vectorized version ), and the second line is the update for .

## 4.2 Updating the weight matrix

The new update for is given by:

We some algebraic manipulations we get back to the expression for the unflattened matrix:

This is to compare with the usual gradient descent update given by:

We can notice 2 additions:

- the update is rescaled and rotated using the 2 matrices and
- the expectation is centered by substracting

In addition to the derivation proposed in KFAC, by expliciting the bias we obtained 2 different centerings:

- the covariance matrix
- the expectation is centered by substracting

## 4.3 Updating the bias vector

The new update for is given by:

This is to compare with the usual gradient descent update given by:

Once again we notice that the update is scaled and rotated using but there is also this strange scalar scaling for which we are not able to give an interpretation. However in practice we noted that we did not have any performance gain compared to using only .

# 5 Conclusion

We gave explicit derivation of the second order updates used in Gauss-Newton and in Natural gradient. By explicitly separating the weight matrices and the bias we obtained a nice centering term in both the covariance matrix used to rotate the update , and in the expectation used to compute the gradient .

It is well-known that centering things is often useful. It is sometimes referred as the centering trick, or mean only batch norm. Another efficient technique called natural neural networks [1] building on the structure of the FIM mentions the trick without giving it much justification. To the best that we know this justification based on the structure of the second order matrices has not yet been contributed. We hope that this blog note can enlighten deep learning practitioners who are not very familliar with second order methods, in order to invent new approximate algorithms with more efficient updates.

# References

[1] . Natural neural networks. *Advances in Neural Information Processing Systems*:2071—2079, 2015.

[2] . Optimizing neural networks with kronecker-factored approximate curvature. *International Conference on Machine Learning*:2408—2417, 2015.

[3] . Revisiting natural gradient for deep networks. *arXiv preprint arXiv:1301.3584*, 2013.