Home > Misc > Neural networks \( \newcommand{\bm}{\mathbf} \)

Neural Networks and Deep Learning, Chapter 3

No introduction, just straight into it

In last week's results section, I started by talking about how I'd spent most of my time tracing a bug (if one can call it that – my code was fine!) in which the neural network was able to classify handwritten digits from the MNIST dataset with about 95% accuracy if the ink pixels had value near 1 and the background near 0, but got stuck in various apparent local minima with no more than about 60% accuracy if I inverted the pixels so that the background pixels had value 1 and the ink pixels were near 0. When I wrote it up I said that I had no idea why the network should struggle in this way.

When prodded on Facebook about it, I thought of a possible cause for this unusual behaviour:

Maybe (handwaving here that I haven't checked rigorously) if you have a lot of 1's in the input layer, then the inputs to the next layer _before_ sending it through the sigmoid function will tend to be a long way from zero.

Now, the sigmoid function's derivative is greatest when the input is zero, and the derivative is small when a long way away from zero, and the cost gradient has one or more sigma-prime terms, so perhaps there's just no gradient to descend down out there.

It was somewhat gratifying to read in chapter 3 this essential argument, with some proposed solutions which I'll sketch shortly. (The problem, to adopt the jargon, is of neurons "saturating".) Curiously, despite seeming to be directly applicable to my inverted-pixel problem, they didn't seem to have much of a useful effect, and instead I stumbled across the solution totally by accident – the digits started to be classified at about the same rate as expected (94-95% on my most recent attempt) when I reduced the learning rate \( \eta \) by a factor of 10 (specifically, from 3 to 0.3).

Recall that the goal of the network is to optimise the weights and biases, and the basic method to achieve this is to define a cost function \( C \), which is a function of all of those parameters, and then use (stochastic) gradient descent, where the vector of parameters (weights and biases) is incremented by \( -\eta \nabla C \). The value of \( \eta \) should be small enough so that the linear approximation to the slope is reasonable, but a value that's too small will mean that the network learns too slowly.

By observing that \( \eta \) should be smaller on background-1 images than on background-0 images, I've really only pushed my lack of understanding back one step, since I don't have a good explanation for why this should be the case. Indeed, at first glance it looks like it should be the other way round – my first guess at the problem was that neurons were saturating and gradients were too low, so surely you'd need a larger \( \eta \) to compensate? Perhaps it's just that summing over ~90% non-zero terms instead of ~10% non-zero terms means that \( \eta \) should be decreased by a factor of approximately 9. Such an explanation is only superficially satisfying to me (the sum then goes into a sigmoid function), but I can imagine that there's some sense in which too many parameters are being moved too far during the updates.

Having said that, I still don't know why one of the output neurons is so often preferred with too-large-\( \eta \) training on the background-1 images. (If I get, say, 50% accuracy, usually 4 digits are usually correctly classified, and everything else gets classified as, e.g., a 6.) Since I've now solved the problem for practical purposes, perhaps I'll leave it be.

Weight initialisation

While it wasn't the cause of my inverted-pixel problems, it's plausible that too many large numbers being sent into the first hidden layer's sigmoid functions could cause problems in some applications. The obvious solution (that I should have thought of) is to appropriately normalise the initial weights. Instead of using standard Gaussians, divide them by the square root of the number of neurons in the layer.


Avoiding the problem of slow learning from saturated neurons in the output layer has a neat solution: choose the cost function so that it cancels out the \( \sigma'\) term that causes the trouble. If \( \bm{y}^L \) is the vector of desired outputs and \( \bm{a}^L \) the actual outputs, then the quadratic cost function used in chapter 1 is \( C = \frac{1}{2}||\bm{a}^L - \bm{y}^L||^2 \) (up to a normalisation). The partial derivative of \( C \) with respect to a bias \( b^L_j \) in the output layer is then

\begin{align*} \frac{\partial C}{\partial b^L_j} &= \sum_k \frac{\partial C}{\partial a^L_k}\frac{\partial a^L_k}{\partial b^L_j} \\ &= (a^L_j - y^L_j)\sigma'^L_j. \end{align*}

It would be nice if instead \( \partial C / \partial b^L_j = a^L_j - y^L_j \), so that the pesky \( \sigma' \) doesn't prevent learning when a neuron is saturated in the wrong way (i.e., if the training output is a 1, but the network says it's very likely 0, then the \( \sigma' \) will mean that this component of \( \nabla C \) is very close to zero, and this part of the network won't update much, despite being horribly wrong). Assuming that the j-th entry of such a cost function only contains terms involving the j-th entries of \( \bm{a}^L \) and \( \bm{y}^L \), the function would need to satisfy

\begin{equation*} \frac{\partial C}{\partial b^L_j} = \frac{\partial C}{\partial a^L_j}\sigma'^L_j. \end{equation*}
Now, the sigmoid function is \( \sigma(z) = 1/(1 + \exp(-z)) \), so its derivative \( \sigma' \) is equal to \( \sigma(z)(1 - \sigma(z)) \). At this point in the derivation, I failed to notice that \( \sigma(z^L_j) = a^L_j \), which is a bit #RealImpostorNotASyndrome, and which meant that I had to read ahead before seeing that therefore
\begin{equation*} a^L_j - y^L_j = \frac{\partial C}{\partial a^L_j} a^L_j (1 - a^L_j), \end{equation*}

which is about as easy a differential equation as they come, since all that's required is dividing both sides by \( a^L_j (1 - a^L_j) \) and integrating. The resulting first term on the LHS is \( 1 / (1 - a^L_j) \) and gives a \( -\ln(1 - a^L_j) \) term; the second term is \( y^L_j / a^L_j (1 - a^L_j) \) and requires partial fractions before also giving (a pair of) log terms. Because it was a partial derivative, the "constant" of integration is actually a function of all the other \( a^L_j \) variables; of course each output neuron value satisfies the same differential equation, and the full cost function sums over all of them. The cost function is the cross-entropy:

\begin{equation*} C = -\sum_j [y^L_j \ln(a^L_j) + (1 - y^L_j)\ln(1 - a^L_j)]. \end{equation*}

Wikipedia says that the cross-entropy is just the negative sum of \( p \ln(q) \), and I don't know if there's some abuse of nomenclature going on with the \( (1 - a) \) term above. Regardless, it's pretty funny to finally run into this term a decade after going through Honours with several Dirk Kroese students who never seemed to shut up about the CE method.

A nice feature of first coding the quadratic cost function and then learning the cross-entropy cost function is that the code for the latter only requires editing the former in a very small way: deleting a single \( \sigma' \) term in the calculation of \( \nabla C \) terms for the output layer. There's no reason to prefer the quadratic cost, so from now on I use the cross-entropy.

Regularisation and over-fitting

I remarked at the end of the previous post that part of me wasn't wowed by the performance of the neural network on recognising handwritten digits – almost 24,000 free parameters ought to be enough to put 28 × 28 matrices into one of ten categories. In chapter 3, Nielsen remarks on the opposite objection: with so many free parameters for so few pixels, how is it that neural networks can have such high success rates, rather than over-fitting to the training data? (The analogy he gives is of fitting a much-too-high order polynomial to a scatterplot.) Perhaps with 50,000 training images, this argument doesn't feel so strong (although one can imagine a 24,000-degree polynomial being totally inappropriate as a regression model, even if there are 50,000 data points).

I was not fully convinced by this argument. There are 784 pixels in each image, and it seems to me that simply reading the pixels will soak up a lot of those free parameters in the system – maybe the relevant comparison is 50,000 training images for 24,000 / 784 ≈ 30 "pseudo-free" parameters. That doesn't look so bad!

But even if my previous paragraph is correct (and I am likely missing some concept), neural networks really are astonishingly good at not over-fitting. Suppose that instead of using 30 neurons in the hidden layer, I use 100. There are now 784*100 + 100 + 100*10 + 10 = 79,510 free parameters in the system. Next, I will restrict the training data to just the first 100 images (not a typo!): a total of just 78,400 pixels for the network to train on, less than the number of free parameters.

With \( \eta = 0.5 \), a batch size of 10 for the stochastic gradient calculations, and using none of the regularisation techniques to be described below, the network after 30 epochs of training successfully classified 6893 out of the 10,000 test images. Almost 70% accuracy with just 100 digits to train from! Here's a jittered scatterplot of the classifications (vertical axis) against true digits (horizontal axis):

MORE! What if the same network has only 50 images to train on? There are now more than twice as many free parameters as there are training pixels, but after 30 epochs (same parameters as before), I get 56% accuracy on the test set.

MORE!!!!! I gave the network 10 images to train on: the first of each digit. Working with 7840 pixels, the network's 79,510 free parameters evolved after 67 epochs to give 51% accuracy on the test set of 10,000 images. Jittered scatterplot:

This is awe-inspiringly good, and I can't help but compare with the speed at which humans are able to generalise from limited "training images". Nielsen cites LeCun et al. (PDF; I haven't read the paper) who conjecture that the gradient descent algorithm has an automatic "self-regularisation effect", but that it is unknown how this happens.

Nevertheless, when trying to squeeze out every last percentage point of accuracy from fully-sized training sets, it would be undesirable to have the network over-fit so that instead of being 95% right, it is only 94% right. (And my impression is that there are problems where over-fitting can be catastrophically bad, rather than mildly annoying as with the MNIST digits.) If the network is over-fitting, then its accuracy and cost function will look better on the training data than on the test data, and the risk is that the performance on the test data gets worse as training continues. This is indeed what happens with the digit classification. The following graph shows the accuracy on the training set (red) trend steadily upwards over 100 epochs of training, while the accuracy on the test set (blue) plateaus for a while at a level noticeably below that of the training data, before gradually heading downwards after around 50 epochs. (Parameters are lost to the mists of time, but probably \( \eta = 0.5 \) with a batch size of 10, and cross-entropy cost.)

Curiously, the cost function shows evidence of over-fitting much earlier than the classification accuracy – in the following graph, the cost function on the training data trends downwards over the 100 epochs, but starts trending upwards after only 20 or 30 epochs.

I don't know what to make of this difference – how is the cost function on the test set getting so much worse, while the accuracy remains steady for a decent number of epochs? Nielsen offers little guidance on this point (talking about the same pattern of results with different numbers):

It poses a puzzle, though, which is whether we should regard epoch 15 or epoch 280 as the point at which overfitting is coming to dominate learning? From a practical point of view, what we really care about is improving classification accuracy on the test data, while the cost on the test data is no more than a proxy for classification accuracy. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is dominating learning in our neural network.

This is unsatisfying, since it's the cost function that drives the training, and it's just through some weird stroke of luck that somehow the cost function partially decouples with what it's supposed to achieve in such a way that the results remain good for a while.

Whatever the resolution to that puzzle, it would be good to avoid over-fitting. Reasoning by analogy to the case of fitting too-high-degree polynomials, which would usually have some very large coefficients, Nielsen introduces L2 regularisation*. (Googling suggests that the 2 isn't usually superscripted in this context? I'm not going to abandon habits picked up in MATH4401 so quickly.) The idea is to add to the cost function the sum of squares of all the weights, multiplied by some positive parameter. This sum will tend to drive the weights towards zero, and the network should then evolve to some balance between small weights and fitting to the training data.

*Actually he introduced it first and then reasoned about it later; whether by accident or by good expository design, the reasoning now comes first in my head.

A motivation, which Nielsen discusses at some length, is that a network with smaller weights is somehow simpler than one with large weights, and might be preferred on Occam's-Razor-type grounds. That L2 regularisation is worthwhile though is demonstrated empirically. The adjustment to the code is minor (it involves taking the derivative of \( W_{jk}^2 \) with respect to \( W_{jk} \)), and with a regularisation parameter of \( \lambda = 5 \) – I should mention for completeness that the sum is normalised by the number of training images \( n \) as \( \frac{\lambda}{2n}\sum W^2 \) – the difference between the accuracy and cost on the training and test data is greatly reduced, and the bad trends seen earlier are replaced by harmless plateaus:

Nielsen mentions a couple of other regularisation techniques, namely L1 and dropout. Dropout in broad brush strokes is conceptually interesting (you randomly delete half the neurons while training, to try to make the network more robust), and perhaps one day I'll sit down and code it, but for the moment it looks too intimidating and I'll happily stick to L2 regularisation.

Increasing the training sample size

Another possibility to reduce overfitting concerns that Nielsen describes is to increase the set of training images. At first glance, this seems like an insultingly obvious thing to suggest, but at least in the MNIST case, there's an idea to artificially increase the number of training images that is as fiendishly clever and obvious in hindsight as it was disappointing when I tried to implement it. (To be clear, the latter is my fault, and people have used the technique described to improve classification accuracy.)

The idea is to transform the training images in small ways – a displacement of a couple of pixels, a rotation, etc. A handwritten '5' displaced two pixels to the right will be just as valid and useful a '5' to train on as the original, but the pixel-by-pixel values will be quite different, and therefore (the hope is) the network can be trained to better recognise the underlying pattern of a '5', rather than the specific pixels of the original image.

Nielsen cites a paper (Simard et al.; paywalled; not read) which described an improvement in classification accuracy from 98.4% to 99.3% by including various transformations including some modelled on the sorts of variations introduced by humans when writing. I, on the other hand, saw my results go backwards from 98% to 94%. Maybe I was transforming too many digits to be partially outside the 28 × 28 square? I don't know. At some point during testing, I did at least manage to get the classification accuracy using only the tiny set of 10 training images up from 51% to about 58% by applying some random transformations.

I'm sure this is a useful idea and I'll keep it in my mind for future problems, but for the moment I'm not bothered enough to work out what I've done wrong.


Nielsen talks at length about softmax layers – an alternative to sigmoid-with-cross-entropy for avoiding the learning slow down for saturated neurons in the output layer. The maths is a mildly interesting exercise in calculus; I haven't coded it up but I recall him saying that we'll return to the subject in a later chapter.

There's also a grab-bag of practical tips for getting started on new problems – with many parameters to fiddle with, it's apparently easy to find yourself totally stuck with terrible results at first. In this discussion, he described the standard way of using the validation data instead of the test data as a place for tuning the parameters, which is a distinction that I have yet to abide by.


The core of the code is very similar to the previous instalment's code. In addition to the features described above, there's also an option in the Rcpp function to "taper" the learning rate, which is a very rough-and-ready way to handle something that would be better handled with some automated system testing for convergence of the cost function or classification accuracy. My not-very-good code to randomly transform the input images is also included, but I haven't tidied it up and there are relics of a previous conception of how it would work.

Show code.


The results from chapter 1 had about 95% accuracy. Rather than show heaps of correctly classified digits, I extracted the first 200 that the old code had trouble with, defined as the true answer being called with less than 70% "probability". (Output values of the network may not sum to 1; I normalise them, but I'm not sure if such values deserve to be called probabilities without scare quotes.)

Classifications for this tough set of 200 digits are shown below for six different setups. The first two use only the basic setup from chapter 1, and didn't undergo much careful tuning of the learning rate beyond being in the range of what Nielsen uses at various times. I think the first one had \( \eta = 0.3 \) for 15 or 20 epochs and then \( \eta = 0.1 \) for another 15 or 20; the second had 10 epochs each at \( \eta = 0.3,\, 0.1,\, 0.05 \). The four tests with the cross-entropy cost all followed the same training regime of 10 epochs at each of \( \eta = 0.5,\, 0.1,\, 0.02 \) in turn.

Classification accuracies aren't rperfectly eproducible since the weights and biases are initialised with random numbers, and I wouldn't trust the decimal place too much. I also tried a hidden layer of 200 neurons, but while it nudged 98.2% in training, it ended at 98.1% like the 100-neuron layer, and I don't include it below.

Sometimes a digit correctly classified by a worse-overall network becomes wrongly classified by the better-overall network.


It was quite satisfying to see the tests break through the psychological threshold of 98%. My recollection from the introduction is that there's only one more percentage point of accuracy to be squeezed out over the rest of the book, which I think will happen with so-called "deep" networks, i.e., networks with more than one hidden layer. (The code I've written is functional on multi-layer networks already, but the results are not an improvement on what's seen above; maybe a smidgen worse, 97.5% instead of 98.1%, but I haven't worked on it carefully.)

The case of the ten training images on a network with many more free parameters than training pixels fascinates me, and I hope to spend some time poking and prodding at the behaviour as the training set becomes pathological.

Chapter 4 is titled "A visual proof that neural nets can compute any function", which doesn't look like it'll lend itself to a write-up. The next post will probably be on a smaller topic – perhaps playing with tiny training sets, or non-neural methods for digit classification (I got 70% doing some slightly weird PCA-based thing). There was a gap of several weeks between the first post in this series and the second, but my day job is starting to ease up, and subsequent posts shouldn't take so long to learn and write.

Posted 2015-11-07, updated 2015-11-14 (minor editing).

Home > Misc > Neural networks