In last week's results section, I started by talking about how I'd spent most of my time tracing a bug (if one
can call it that – my code was fine!) in which the neural network was able to classify handwritten digits from the MNIST dataset with about 95%
accuracy if the ink pixels had value near 1 and the background near 0, but got stuck in various apparent local minima with no more than about 60%
accuracy if I inverted the pixels so that the background pixels had value 1 and the ink pixels were near 0. When I wrote it up I said that I had no
idea why the network should struggle in this way.
When prodded on Facebook about it, I thought of a possible cause for this unusual behaviour:
Maybe (handwaving here that I haven't checked rigorously) if you have a lot of 1's in the input layer, then the inputs to the next layer
_before_ sending it through the sigmoid function will tend to be a long way from zero.
Now, the sigmoid function's derivative is greatest when the input is zero, and the derivative is small when a long way away from zero, and the cost
gradient has one or more sigma-prime terms, so perhaps there's just no gradient to descend down out there.
It was somewhat gratifying to read in chapter 3 this essential argument, with some
proposed solutions which I'll sketch shortly. (The problem, to adopt the jargon, is of neurons "saturating".) Curiously, despite seeming to be directly
applicable to my inverted-pixel problem, they didn't seem to have much of a useful effect, and instead I stumbled across the solution totally by
accident – the digits started to be classified at about the same rate as expected (94-95% on my most recent attempt) when I reduced the learning
rate \( \eta \) by a factor of 10 (specifically, from 3 to 0.3).
Recall that the goal of the network is to optimise the weights and biases, and the basic method to achieve this is to define a cost function
\( C \), which is a function of all of those parameters, and then use (stochastic) gradient descent, where the vector of parameters (weights and biases)
is incremented by \( -\eta \nabla C \). The value of \( \eta \) should be small enough so that the linear approximation to the slope is reasonable,
but a value that's too small will mean that the network learns too slowly.
By observing that \( \eta \) should be smaller on background-1 images than on background-0 images, I've really only pushed my lack of understanding
back one step, since I don't have a good explanation for why this should be the case. Indeed, at first glance it looks like it should be the other way
round – my first guess at the problem was that neurons were saturating and gradients were too low, so surely you'd need a larger \( \eta \)
to compensate? Perhaps it's just that summing over ~90% non-zero terms instead of ~10% non-zero terms means that \( \eta \) should be decreased by
a factor of approximately 9. Such an explanation is only superficially satisfying to me (the sum then goes into a sigmoid function), but I can
imagine that there's some sense in which too many parameters are being moved too far during the updates.
Having said that, I still don't know why one of the output neurons is so often preferred with too-large-\( \eta \) training on the
background-1 images. (If I get, say, 50% accuracy, usually 4 digits are usually correctly classified, and everything else gets classified as,
e.g., a 6.) Since I've now solved the problem for practical purposes, perhaps I'll leave it be.
Weight initialisation
While it wasn't the cause of my inverted-pixel problems, it's plausible that too many large numbers being sent into the first hidden layer's sigmoid
functions could cause problems in some applications. The obvious solution (that I should have thought of) is to appropriately normalise the initial
weights. Instead of using standard Gaussians, divide them by the square root of the number of neurons in the layer.
Cross-entropy
Avoiding the problem of slow learning from saturated neurons in the output layer has a neat solution: choose the cost function so that
it cancels out the \( \sigma'\) term that causes the trouble. If \( \bm{y}^L \) is the vector of desired outputs and \( \bm{a}^L \) the actual
outputs, then the quadratic cost function used in chapter 1 is \( C = \frac{1}{2}||\bm{a}^L - \bm{y}^L||^2 \) (up to a normalisation). The partial
derivative of \( C \) with respect to a bias \( b^L_j \) in the output layer is then
It would be nice if instead \( \partial C / \partial b^L_j = a^L_j - y^L_j \), so that the pesky \( \sigma' \) doesn't prevent learning when a neuron is
saturated in the wrong way (i.e., if the training output is a 1, but the network says it's very likely 0, then the \( \sigma' \) will mean that
this component of \( \nabla C \) is very close to zero, and this part of the network won't update much, despite being horribly wrong). Assuming that the
j-th entry of such a cost function only contains terms involving the j-th entries of \( \bm{a}^L \) and \( \bm{y}^L \), the function would need to
satisfy
Now, the sigmoid function is \( \sigma(z) = 1/(1 + \exp(-z)) \), so its derivative \( \sigma' \) is equal to \( \sigma(z)(1 - \sigma(z)) \). At this
point in the derivation, I failed to notice that \( \sigma(z^L_j) = a^L_j \), which is a bit #RealImpostorNotASyndrome, and which meant that I had
to read ahead before seeing that therefore
which is about as easy a differential equation as they come, since all that's required is dividing both sides by \( a^L_j (1 - a^L_j) \) and
integrating. The resulting first term on the LHS is \( 1 / (1 - a^L_j) \) and gives a \( -\ln(1 - a^L_j) \) term; the second term is
\( y^L_j / a^L_j (1 - a^L_j) \) and requires partial fractions before also giving (a pair of) log terms. Because it was a partial derivative, the
"constant" of integration is actually a function of all the other \( a^L_j \) variables; of course each output neuron value satisfies the same
differential equation, and the full cost function sums over all of them. The cost function is the cross-entropy:
Wikipedia says that the cross-entropy is just the negative sum of \( p \ln(q) \), and I
don't know if there's some abuse of nomenclature going on with the \( (1 - a) \) term above. Regardless, it's pretty funny to finally run into this term
a decade after going through Honours with several Dirk Kroese students who never seemed to shut up about the
CE method.
A nice feature of first coding the quadratic cost function and then learning the cross-entropy cost function is that the code for the latter only
requires editing the former in a very small way: deleting a single \( \sigma' \) term in the calculation of \( \nabla C \) terms for the output
layer. There's no reason to prefer the quadratic cost, so from now on I use the cross-entropy.
Regularisation and over-fitting
I remarked at the end of the previous post that part of me wasn't wowed by the performance of the neural network on
recognising handwritten digits – almost 24,000 free parameters ought to be enough to put 28 × 28 matrices into one of ten categories.
In chapter 3, Nielsen remarks on the opposite objection: with so many free parameters for so few pixels, how is it that neural networks can have
such high success rates, rather than over-fitting to the training data? (The analogy he gives is of fitting a much-too-high order polynomial to a
scatterplot.) Perhaps with 50,000 training images, this argument doesn't feel so strong (although one can imagine a 24,000-degree polynomial being
totally inappropriate as a regression model, even if there are 50,000 data points).
I was not fully convinced by this argument. There are 784 pixels in each image, and it seems to me that simply reading the pixels will soak up
a lot of those free parameters in the system – maybe the relevant comparison is 50,000 training images for 24,000 / 784 ≈ 30 "pseudo-free"
parameters. That doesn't look so bad!
But even if my previous paragraph is correct (and I am likely missing some concept), neural networks really are astonishingly good at not
over-fitting. Suppose that instead of using 30 neurons in the hidden layer, I use 100. There are now 784*100 + 100 + 100*10 + 10 = 79,510 free
parameters in the system. Next, I will restrict the training data to just the first 100 images (not a typo!): a total of just 78,400 pixels
for the network to train on, less than the number of free parameters.
With \( \eta = 0.5 \), a batch size of 10 for the stochastic gradient calculations, and using none of the regularisation techniques to be described
below, the network after 30 epochs of training successfully classified 6893 out of the 10,000 test images. Almost 70% accuracy with just 100 digits
to train from! Here's a jittered scatterplot of the classifications (vertical axis) against true digits (horizontal axis):
MORE! What if the same network has only 50 images to train on? There are now more than twice as many free parameters as there are training
pixels, but after 30 epochs (same parameters as before), I get 56% accuracy on the test set.
MORE!!!!! I gave the network 10 images to train on: the first of each digit. Working with
7840 pixels, the network's 79,510 free parameters evolved after 67 epochs to give 51% accuracy on the test set of 10,000 images. Jittered scatterplot:
This is awe-inspiringly good, and I can't help but compare with the speed at which humans are able to generalise from limited "training images".
Nielsen cites LeCun et al. (PDF; I haven't read the paper) who conjecture that the
gradient descent algorithm has an automatic "self-regularisation effect", but that it is unknown how this happens.
Nevertheless, when trying to squeeze out every last percentage point of accuracy from fully-sized training sets, it would be undesirable to have
the network over-fit so that instead of being 95% right, it is only 94% right. (And my impression is that there are problems where over-fitting can
be catastrophically bad, rather than mildly annoying as with the MNIST digits.) If the network is over-fitting, then its accuracy and cost function
will look better on the training data than on the test data, and the risk is that the performance on the test data gets worse as training continues.
This is indeed what happens with the digit classification. The following graph shows the accuracy on the training set (red) trend steadily upwards over
100 epochs of training, while the accuracy on the test set (blue) plateaus for a while at a level noticeably below that of the training data, before
gradually heading downwards after around 50 epochs. (Parameters are lost to the mists of time, but probably \( \eta = 0.5 \) with a batch size of 10,
and cross-entropy cost.)
Curiously, the cost function shows evidence of over-fitting much earlier than the classification accuracy – in the following graph, the cost
function on the training data trends downwards over the 100 epochs, but starts trending upwards after only 20 or 30 epochs.
I don't know what to make of this difference – how is the cost function on the test set getting so much worse, while the accuracy remains
steady for a decent number of epochs? Nielsen offers little guidance on this point (talking about the same pattern of results with different numbers):
It poses a puzzle, though, which is whether we should regard epoch 15 or epoch 280 as the point at which overfitting is coming to dominate
learning? From a practical point of view, what we really care about is improving classification accuracy on the test data, while the cost on the test
data is no more than a proxy for classification accuracy. And so it makes most sense to regard epoch 280 as the point beyond which overfitting is
dominating learning in our neural network.
This is unsatisfying, since it's the cost function that drives the training, and it's just through some weird stroke of luck that somehow the
cost function partially decouples with what it's supposed to achieve in such a way that the results remain good for a while.
Whatever the resolution to that puzzle, it would be good to avoid over-fitting. Reasoning by analogy to the case of fitting too-high-degree
polynomials, which would usually have some very large coefficients, Nielsen introduces L2 regularisation*. (Googling suggests that the 2 isn't
usually superscripted in this context? I'm not going to abandon habits picked up in MATH4401 so quickly.) The idea is to add to the cost function
the sum of squares of all the weights, multiplied by some positive parameter. This sum will tend to drive the weights towards zero, and the network
should then evolve to some balance between small weights and fitting to the training data.
*Actually he introduced it first and then reasoned about it later; whether by accident or by good expository design, the reasoning now
comes first in my head.
A motivation, which Nielsen discusses at some length, is that a network with smaller weights is somehow simpler than one with large weights, and might
be preferred on Occam's-Razor-type grounds. That L2 regularisation is worthwhile though is demonstrated empirically. The adjustment to
the code is minor (it involves taking the derivative of \( W_{jk}^2 \) with respect to \( W_{jk} \)), and with a regularisation parameter of
\( \lambda = 5 \) – I should mention for completeness that the sum is normalised by the number of training images \( n \) as
\( \frac{\lambda}{2n}\sum W^2 \) – the difference between the accuracy and cost on the training and test data is greatly reduced, and the bad
trends seen earlier are replaced by harmless plateaus:
Nielsen mentions a couple of other regularisation techniques, namely L1 and dropout. Dropout in broad brush strokes is conceptually
interesting (you randomly delete half the neurons while training, to try to make the network more robust), and perhaps one day I'll sit down and code it,
but for the moment it looks too intimidating and I'll happily stick to L2 regularisation.
Increasing the training sample size
Another possibility to reduce overfitting concerns that Nielsen describes is to increase the set of training images. At first glance, this seems
like an insultingly obvious thing to suggest, but at least in the MNIST case, there's an idea to artificially increase the number of training images
that is as fiendishly clever and obvious in hindsight as it was disappointing when I tried to implement it. (To be clear, the latter is my fault, and
people have used the technique described to improve classification accuracy.)
The idea is to transform the training images in small ways – a displacement of a couple of pixels, a rotation, etc. A handwritten '5'
displaced two pixels to the right will be just as valid and useful a '5' to train on as the original, but the pixel-by-pixel values will be quite
different, and therefore (the hope is) the network can be trained to better recognise the underlying pattern of a '5', rather than the specific pixels
of the original image.
Nielsen cites a paper (Simard et al.; paywalled; not read) which described an
improvement in classification accuracy from 98.4% to 99.3% by including various transformations including some modelled on the sorts of variations
introduced by humans when writing. I, on the other hand, saw my results go backwards from 98% to 94%. Maybe I was transforming too many digits to
be partially outside the 28 × 28 square? I don't know. At some point during testing, I did at least manage to get the classification
accuracy using only the tiny set of 10 training images up from 51% to about 58% by applying some random transformations.
I'm sure this is a useful idea and I'll keep it in my mind for future problems, but for the moment I'm not bothered enough to work out what I've
done wrong.
Miscellaneous
Nielsen talks at length about softmax layers – an alternative to sigmoid-with-cross-entropy for avoiding the learning slow down for
saturated neurons in the output layer. The maths is a mildly interesting exercise in calculus; I haven't coded it up but I recall him
saying that we'll return to the subject in a later chapter.
There's also a grab-bag of practical tips for getting started on new problems – with many parameters to fiddle with, it's apparently
easy to find yourself totally stuck with terrible results at first. In this discussion, he described the standard way of using the validation data
instead of the test data as a place for tuning the parameters, which is a distinction that I have yet to abide by.
Code
The core of the code is very similar to the previous instalment's code. In addition to the features described above,
there's also an option in the Rcpp function to "taper" the learning rate, which is a very rough-and-ready way to handle something that would be
better handled with some automated system testing for convergence of the cost function or classification accuracy. My not-very-good code to
randomly transform the input images is also included, but I haven't tidied it up and there are relics of a previous conception of how it would work.
Show code.
R code:
library(Rcpp)
library(ggplot2)
sourceCpp("train_test_network.cpp")
load("digit_data/r_images_nielsen.rda")
new_network = function(n) {
# n is vector containing the number of neurons in each layer
num_layers = length(n)
if (num_layers < 2) {
stop("Need longer input to new_network()")
}
network = list()
length(network) = num_layers
for (i in seq_along(n)) {
this_n = n[i]
if (i < num_layers) {
next_n = n[i+1]
}
k = 1
network[[i]] = list()
network[[i]]$a = 0*rnorm(this_n)
network[[i]]$bias = k*rnorm(this_n)
if (i == num_layers) {
network[[i]]$bias = k*rnorm(this_n)
}
if (i < num_layers) {
next_n = n[i+1]
network[[i]]$weight = matrix(k*rnorm(this_n*next_n) / sqrt(n[i]*100), nrow=next_n, ncol=this_n)
}
}
return(network)
}
start_network = new_network(c(784, 100, 10))
# If I want to use only one image of each digit as the training set:
restricted_training = c(2, 4, 6, 8, 3, 1, 14, 16, 18, 5)
out_network = train_network(images$training, vec_digits$training, start_network, images$test, vec_digits$test,
lambda=5, eta=0.02, batch_size=10, num_epochs=5, expand_set=FALSE,
eta_tapers=0, taper=5.0)
test_answers = test_network(images$test, out_network)
test_answers_vec = sapply(test_answers, function(x) { which(x == max(x))[1] - 1 })
true_answers = digits$test
qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE)
Rcpp code:
#include <Rcpp.h>
using namespace Rcpp;
double sigmoid(double z) {
return 1.0 / (1.0 + exp(-z));
}
NumericVector clone_vector(NumericVector input) {
long N = input.size();
NumericVector v(N);
for (long i = 0; i < N; i++) {
v[i] = input[i];
}
return v;
}
NumericMatrix clone_matrix(NumericMatrix input) {
long N1 = input.nrow();
long N2 = input.ncol();
NumericMatrix M(N1, N2);
for (long i = 0; i < N1; i++) {
for (long j = 0; j < N2; j++) {
M(i, j) = input(i, j);
}
}
return M;
}
// http://gallery.rcpp.org/articles/stl-random-shuffle/
// wrapper around R's RNG such that we get a uniform distribution over
// [0,n) as required by the STL algorithm
inline int rand_wrapper(const int n) { return floor(unif_rand()*n); }
NumericVector shuffle(NumericVector a) {
// clone a into b to leave a alone
NumericVector b = clone(a);
std::random_shuffle(b.begin(), b.end(), rand_wrapper);
return b;
}
List feed_forward(NumericVector input, List network) {
int num_layers = network.size();
List this_list, prev_list;
NumericVector a, prev_a, b;
long i, j, k;
double sum_z;
for (i = 0; i < num_layers; i++) {
this_list = network[i];
a = this_list["a"];
if (i == 0) {
for (j = 0; j < input.size(); j++) {
a[j] = input[j];
}
} else {
prev_list = network[i - 1];
prev_a = prev_list["a"];
b = this_list["bias"];
NumericMatrix W = prev_list["weight"];
for (j = 0; j < a.size(); j++) {
sum_z = 0;
for (k = 0; k < prev_a.size(); k++) {
sum_z += W(j, k) * prev_a[k];
}
sum_z += b[j];
a[j] = sigmoid(sum_z);
}
}
}
return network;
}
long max_i(NumericVector v) {
long N = v.size();
double max_v = max(v);
for (long i = 0; i < N; i++) {
if (v[i] == max_v) {
return i;
}
}
return -1;
}
// [[Rcpp::export]]
List test_network(List test_images, List network) {
long N = test_images.size();
long num_layers = network.size();
List answers(N);
List temp_list;
temp_list = network[num_layers - 1];
NumericVector a_out = clone_vector(temp_list["a"]);
long num_out = a_out.size();
for (long i = 0; i < N; i++) {
NumericVector this_test = test_images[i];
network = feed_forward(this_test, network);
answers[i] = NumericVector(num_out);
NumericVector v = answers[i];
temp_list = network[num_layers - 1];
NumericVector temp_v = clone_vector(temp_list["a"]);
for (long j = 0; j < num_out; j++) {
v[j] = temp_v[j];
}
}
return answers;
}
NumericVector test_scores(List test_set, List test_outputs, List network) {
List test_answers = test_network(test_set, network);
long num_test_correct = 0;
long num_test_images = test_set.size();
long this_true_digit;
double test_cost = 0;
for (long i = 0; i < num_test_images; i++) {
NumericVector this_ans = test_answers[i];
NumericVector this_true = test_outputs[i];
this_true_digit = max_i(this_true);
if (max_i(this_ans) == this_true_digit) {
num_test_correct++;
}
if (fabs(this_ans[this_true_digit] - 1.0) > 1e-20) {
test_cost += log(this_ans[this_true_digit]);
}
}
test_cost /= -1.0 * num_test_images;
NumericVector v(2);
v[0] = num_test_correct;
v[1] = test_cost;
return v;
}
// [[Rcpp::export]]
NumericVector modify_image(NumericVector this_training, long mod_set) {
// Displace image, then stretch, then rotate. Not sophisticated with
// rounding pixel values / aliasing / whatever.
long N = this_training.size();
NumericVector displaced_image(N);
NumericVector rotated_image(N);
NumericVector stretched_image(N);
// Manually set the size of the matrix!!
long N_rows = 28;
long N_cols = 28;
if (N_rows * N_cols != N) {
Rcpp::stop("Mismatched sizes in modify_image()");
}
long i, j, d_i, d_j, idx;
double theta, i_double, j_double, i_0, j_0;
long rot_type = 999;
d_i = round(R::runif(-5, 5));
d_j = round(R::runif(-5, 5));
for (idx = 0; idx < N; idx++) {
j = (idx % N_cols) + d_j;
i = ((idx - j) / N_rows) + d_i;
if ((i < 0) || (i >= N_rows) || (j < 0) || (j >= N_cols)) {
// Careful on the background pixel colour!
displaced_image[idx] = 0.0;
} else {
displaced_image[idx] = this_training[i*N_cols + j];
}
}
i_0 = (double)N_rows / 2.0;
j_0 = (double)N_cols / 2.0;
double stretch_factor = R::runif(0.5, 1.3);
bool stretch_x = (R::runif(0.0, 1.0) > 0.5);
for (idx = 0; idx < N; idx++) {
j = (idx % N_cols);
i = ((idx - j) / N_rows);
if (stretch_x) {
i_double = i_0 + ((double)i - i_0)*stretch_factor;
i = round(i_double);
} else {
j_double = j_0 + ((double)j - j_0)*stretch_factor;
j = round(j_double);
}
if ((i < 0) || (i >= N_rows) || (j < 0) || (j >= N_cols)) {
// Careful on the background pixel colour!
stretched_image[idx] = 0.0;
} else {
stretched_image[idx] = displaced_image[i*N_cols + j];
}
}
if (rot_type == 0) {
return displaced_image;
} else {
theta = R::runif(-0.34906585, 0.34906585);
for (idx = 0; idx < N; idx++) {
j = (idx % N_cols);
i = ((idx - j) / N_rows);
i_double = i_0 - ((double)j - j_0)*sin(theta) + ((double)i - i_0)*cos(theta);
j_double = j_0 + ((double)j - j_0)*cos(theta) + ((double)i - i_0)*sin(theta);
i = round(i_double);
j = round(j_double);
if ((i < 0) || (i >= N_rows) || (j < 0) || (j >= N_cols)) {
// Careful on the background pixel colour!
rotated_image[idx] = 0.0;
} else {
rotated_image[idx] = stretched_image[i*N_cols + j];
}
}
return rotated_image;
}
// Shouldn't get to here.
return this_training;
}
// [[Rcpp::export]]
List train_network(List training_set, List training_outputs, List network,
List test_set, List test_outputs,
double lambda, double eta, long batch_size, long num_epochs, bool expand_set,
int eta_tapers, double taper) {
long i, j, k, train_ct, i_train, max_this_batch, i_epoch, i_layer, base_i_train, train_set_i;
double this_deriv_part, this_deriv_term, test_cost;
NumericVector next_a_term, this_a_term, this_test_scores, this_training_scores;
NumericMatrix aux_W;
List aux_list;
int num_layers = network.size();
long num_training_images = training_set.size();
long num_base_training_images = num_training_images;
if (expand_set) {
num_training_images *= 1;
}
long num_test_images = test_set.size();
long orig_epochs = num_epochs;
num_epochs += eta_tapers * num_epochs;
long num_test_correct, this_true_digit;
long num_neurons[num_layers];
double eta_batch = eta / batch_size;
lambda = lambda / num_training_images;
List delta_W(num_layers - 1);
List delta_b(num_layers);
List helper_delta_b(num_layers);
List temp_list;
for(i = 0; i < num_layers; i++) {
// Rcpp doesn't seem to like combining the following three lines:
temp_list = network[i];
NumericVector this_layer = temp_list["a"];
num_neurons[i] = this_layer.size();
delta_b[i] = NumericVector(num_neurons[i]);
if (i > 0) {
delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]);
}
}
NumericVector order_images(num_training_images);
for (i = 0; i < num_training_images; i++) {
order_images[i] = i;
}
Rprintf("Starting training.\n");
for (i_epoch = 0; i_epoch < num_epochs; i_epoch++) {
order_images = shuffle(order_images);
if (i_epoch > 0) {
if (i_epoch % orig_epochs == 0) {
eta /= taper;
Rprintf("Reducing eta to %f\n", eta);
}
}
train_ct = 0;
do {
// Re-initialise the increment lists:
for (i = 0; i < num_layers; i++) {
delta_b[i] = NumericVector(num_neurons[i]);
if (i > 0) {
delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]);
}
}
max_this_batch = train_ct + batch_size;
if (max_this_batch > num_training_images) {
max_this_batch = num_training_images;
}
for (i_train = train_ct; i_train < max_this_batch; i_train++) {
NumericVector this_training, this_training_output;
if (expand_set) {
base_i_train = (long)order_images[i_train] % num_base_training_images;
train_set_i = ((long)order_images[i_train] - base_i_train) / num_base_training_images;
this_training = training_set[(long)base_i_train];
this_training = modify_image(this_training, train_set_i);
this_training_output = training_outputs[(long)base_i_train];
} else {
this_training = training_set[(long)order_images[i_train]];
this_training_output = training_outputs[(long)order_images[i_train]];
}
network = feed_forward(this_training, network);
for (i = 0; i < num_layers; i++) {
helper_delta_b[i] = NumericVector(num_neurons[i]);
}
List temp_list = network[num_layers - 1];
NumericVector output_a = clone_vector(temp_list["a"]);
NumericVector output_d_sig = clone_vector(temp_list["a"]);
// Start with the output layer.
NumericVector this_helper = helper_delta_b[num_layers - 1];
for (j = 0; j < num_neurons[num_layers - 1]; j++) {
output_d_sig[j] = output_d_sig[j] * (1.0 - output_d_sig[j]);
// Cost function of ||a - y||^2:
// this_helper[j] = (output_a[j] - this_training_output[j]) * output_d_sig[j];
this_helper[j] = output_a[j] - this_training_output[j];
}
for (i = num_layers - 2; i >= 1; i--) {
NumericVector this_helper = helper_delta_b[i];
NumericVector next_helper = helper_delta_b[i + 1];
List temp_list = network[i];
NumericMatrix W = temp_list["weight"];
NumericVector this_a = clone_vector(temp_list["a"]);
for (j = 0; j < num_neurons[i]; j++) {
this_helper[j] = 0.0;
for (k = 0; k < num_neurons[i + 1]; k++) {
this_helper[j] += next_helper[k] * W(k, j) ;
}
this_helper[j] *= this_a[j] * (1.0 - this_a[j]);
}
}
for (i = num_layers - 1; i >= 1; i--) {
temp_list = network[i - 1];
NumericVector prev_a = clone_vector(temp_list["a"]);
NumericVector this_helper = helper_delta_b[i];
NumericVector this_delta_b = delta_b[i];
NumericMatrix this_delta_W = delta_W[i - 1];
for (j = 0; j < num_neurons[i]; j++) {
for (k = 0; k < num_neurons[i - 1]; k++) {
this_delta_W(j, k) += this_helper[j] * prev_a[k];
}
this_delta_b[j] += this_helper[j];
}
}
}
// Increment weights and biases:
for (i = 0; i < num_layers - 1; i++) {
List temp_list = network[i];
NumericMatrix W = temp_list["weight"];
List temp_list2 = network[i + 1];
NumericVector b = temp_list2["bias"];
NumericMatrix this_delta_W = delta_W[i];
NumericVector this_delta_b = delta_b[i + 1];
for (j = 0; j < num_neurons[i + 1]; j++) {
for (k = 0; k < num_neurons[i]; k++) {
W(j, k) -= eta_batch*this_delta_W(j, k) + eta*lambda*W(j, k);
}
b[j] -= eta_batch*this_delta_b[j];
}
}
train_ct = max_this_batch;
} while (train_ct < num_training_images - 1);
this_test_scores = test_scores(test_set, test_outputs, network);
this_training_scores = test_scores(training_set, training_outputs, network);
Rprintf("%d,%d,%d,%f,%d,%d,%f,%f\n",
i_epoch + 1, (long)this_test_scores[0], num_test_images, this_test_scores[1],
(long)this_training_scores[0], num_training_images, this_training_scores[1],
this_test_scores[1] - this_training_scores[1]);
}
return network;
}
Results
The results from chapter 1 had about 95% accuracy. Rather than show heaps of correctly classified digits,
I extracted the first 200 that the old code had trouble with, defined as the true answer being called with less than 70% "probability". (Output
values of the network may not sum to 1; I normalise them, but I'm not sure if such values deserve to be called probabilities without scare quotes.)
Classifications for this tough set of 200 digits are shown below for six different setups. The first two use only the basic setup from
chapter 1, and didn't undergo much careful tuning of the learning rate beyond being in the range of what Nielsen uses at various times. I think
the first one had \( \eta = 0.3 \) for 15 or 20 epochs and then \( \eta = 0.1 \) for another 15 or 20; the second had 10 epochs each at \( \eta =
0.3,\, 0.1,\, 0.05 \). The four tests with the cross-entropy cost all followed the same training regime of 10 epochs at each of \( \eta = 0.5,\, 0.1,\, 0.02 \)
in turn.
Classification accuracies aren't rperfectly eproducible since the weights and biases are initialised with random numbers, and I wouldn't trust
the decimal place too much. I also tried a hidden layer of 200 neurons, but while it nudged 98.2% in training, it ended at 98.1% like the 100-neuron
layer, and I don't include it below.
Sometimes a digit correctly classified by a worse-overall network becomes wrongly classified by the better-overall network.
Conclusions
It was quite satisfying to see the tests break through the psychological threshold of 98%. My recollection from the introduction is that there's
only one more percentage point of accuracy to be squeezed out over the rest of the book, which I think will happen with so-called "deep" networks, i.e.,
networks with more than one hidden layer. (The code I've written is functional on multi-layer networks already, but the results are not an
improvement on what's seen above; maybe a smidgen worse, 97.5% instead of 98.1%, but I haven't worked on it carefully.)
The case of the ten training images on a network with many more free parameters than training pixels fascinates me, and I hope to spend some time
poking and prodding at the behaviour as the training set becomes pathological.
Chapter 4 is titled "A visual proof that neural nets can compute any function", which doesn't look like it'll lend itself to a write-up. The next
post will probably be on a smaller topic – perhaps playing with tiny training sets, or non-neural methods for digit classification (I got
70% doing some slightly weird PCA-based thing). There was a gap of several weeks between the first post in this series and the second, but my day
job is starting to ease up, and subsequent posts shouldn't take so long to learn and write.