## Neural Networks and Deep Learning, Chapter 3

### No introduction, just straight into it

In last week's results section, I started by talking about how I'd spent most of my time tracing a bug (if one can call it that – my code was fine!) in which the neural network was able to classify handwritten digits from the MNIST dataset with about 95% accuracy if the ink pixels had value near 1 and the background near 0, but got stuck in various apparent local minima with no more than about 60% accuracy if I inverted the pixels so that the background pixels had value 1 and the ink pixels were near 0. When I wrote it up I said that I had no idea why the network should struggle in this way.

When prodded on Facebook about it, I thought of a possible cause for this unusual behaviour:

Maybe (handwaving here that I haven't checked rigorously) if you have a lot of 1's in the input layer, then the inputs to the next layer _before_ sending it through the sigmoid function will tend to be a long way from zero.

Now, the sigmoid function's derivative is greatest when the input is zero, and the derivative is small when a long way away from zero, and the cost gradient has one or more sigma-prime terms, so perhaps there's just no gradient to descend down out there.

It was somewhat gratifying to read in chapter 3 this essential argument, with some proposed solutions which I'll sketch shortly. (The problem, to adopt the jargon, is of neurons "saturating".) Curiously, despite seeming to be directly applicable to my inverted-pixel problem, they didn't seem to have much of a useful effect, and instead I stumbled across the solution totally by accident – the digits started to be classified at about the same rate as expected (94-95% on my most recent attempt) when I reduced the learning rate \( \eta \) by a factor of 10 (specifically, from 3 to 0.3).

Recall that the goal of the network is to optimise the weights and biases, and the basic method to achieve this is to define a cost function \( C \), which is a function of all of those parameters, and then use (stochastic) gradient descent, where the vector of parameters (weights and biases) is incremented by \( -\eta \nabla C \). The value of \( \eta \) should be small enough so that the linear approximation to the slope is reasonable, but a value that's too small will mean that the network learns too slowly.

By observing that \( \eta \) should be smaller on background-1 images than on background-0 images, I've really only pushed my lack of understanding
back one step, since I don't have a good explanation for why this should be the case. Indeed, at first glance it looks like it should be the other way
round – my first guess at the problem was that neurons were saturating and gradients were too low, so surely you'd need a *larger* \( \eta \)
to compensate? Perhaps it's just that summing over ~90% non-zero terms instead of ~10% non-zero terms means that \( \eta \) should be decreased by
a factor of approximately 9. Such an explanation is only superficially satisfying to me (the sum then goes into a sigmoid function), but I can
imagine that there's some sense in which too many parameters are being moved too far during the updates.

Having said *that*, I still don't know why one of the output neurons is so often preferred with too-large-\( \eta \) training on the
background-1 images. (If I get, say, 50% accuracy, usually 4 digits are usually correctly classified, and everything else gets classified as,
e.g., a 6.) Since I've now solved the problem for practical purposes, perhaps I'll leave it be.

### Weight initialisation

While it wasn't the cause of my inverted-pixel problems, it's plausible that too many large numbers being sent into the first hidden layer's sigmoid functions could cause problems in some applications. The obvious solution (that I should have thought of) is to appropriately normalise the initial weights. Instead of using standard Gaussians, divide them by the square root of the number of neurons in the layer.

### Cross-entropy

Avoiding the problem of slow learning from saturated neurons in the *output* layer has a neat solution: choose the cost function so that
it cancels out the \( \sigma'\) term that causes the trouble. If \( \bm{y}^L \) is the vector of desired outputs and \( \bm{a}^L \) the actual
outputs, then the quadratic cost function used in chapter 1 is \( C = \frac{1}{2}||\bm{a}^L - \bm{y}^L||^2 \) (up to a normalisation). The partial
derivative of \( C \) with respect to a bias \( b^L_j \) in the output layer is then