Home > Misc > Neural networks \( \newcommand{\bm}{\mathbf} \)

Neural networks and Deep Learning, Chapter 1


This post is the first in what I hope will be a series, as I work through Michael Nielsen's free online book Neural Networks and Deep Learning. Nielsen provides Python scripts to implement the networks he describes in the text. I originally thought that I'd use this as a way to learn Python, but through laziness I've instead decided to stick with R and Rcpp; my rationalisation is that if I code everything from scratch, then I will be forced to learn the algorithms better than if I just stare at Nielsen's code.

I don't know if R and Rcpp is a sensible strategy in the long run – R doesn't always cope well with even moderately large datasets, and I expect to encounter these in machine learning problems – but for the MNIST database of handwritten digits, which is the main dataset used in the book and which contains 60,000 28x28-pixel digits, everything seems to go OK.

The book feels to me (so far!) like it should be accessible in all its mathematical details to a 3rd or 4th year undergrad in a mathsy course. I was able to derive most of chapter 2 (albeit implementing it inefficiently when I first coded it up) before reading it, and it felt like a maths assignment at roughly that level. Since you can get the full details and explanations from Nielsen, I'll be pretty sparse with the theory in these posts.

Some results of my first working neural network and some non-working ones are in the results section below.


A neural network consists of a series of layers, each layer containing some number of neurons. The first layer is the input; for the digit-recognition problem, there will be one input neuron for each pixel, so 282 = 784 neurons. The output layer is usually set by what we want to have the network do – in this case, there will be 10 output neurons, one for each digit between 0 and 9. We would like an output neuron to fire if and only if the input image shows the corresponding digit.

In between the input and output layers is some number of so-called hidden layers. For this chapter there will only be one hidden layer; apparently it's a lot harder to train a network with more than one hidden layer, so we'll be starting simple. Given the large number of input neurons and small number of output neurons, I expect that the layers are monotone decreasing in their numbers of neurons, but I don't know for sure. (In the recommended setup for this chapter, there are 30 neurons in the hidden layer, in between the 784 inputs and 10 outputs.)

Each neuron in the input layer has some value, read from the current training image; let \( \bm{a}^0 \) be the vector of these values. (I'll use superscript indices to denote the relevant layer in the network; be careful if you're working through these notes closely though, since sometimes I deviate from Nielsen's indexing conventions. He probably uses standard conventions and mine should be considered anarchic.)

The "activation" value of each neuron in the next layer (i.e., the hidden layer) \( a^1_j \) is a linear combination of those in the first layer, plus the neuron's "bias" \( b^1_j \), and then transformed to the interval \( (0, 1) \) by a sigmoid function, \( \sigma(z) = 1 / (1 + \exp(-z)) \). Allowing \( \sigma \) to work element-wise through a vector, and letting \( W^0 \) be an \( N_1 \times N_0 \) matrix, the vector of activation values of the hidden layer's neurons is

\begin{equation*} \bm{a}^1 = \sigma(W^0 \bm{a}^0 + \bm{b}^1). \end{equation*}

Subsequent layers of neurons are defined similarly. The goal of training the network is to set all the weights \( W^i_{jk} \), and all of the biases \( b^i_j \), so that the activation values of the output neurons agree with the true answer whenever a new image is input (for this chapter, all of these values are initialised random standard Gaussian numbers). To this end, 50,000 of the MNIST images are used as a training set, and 10,000 as a test set (written by a different set of people to those from the training set, so that the testing is properly out of sample).

In a short while, it will be useful to have defined

\begin{equation*} \bm{z}^1 = W^0 \bm{a}^0 + \bm{b}^1, \end{equation*}

so that, in general, \( \bm{a}^i = \sigma(\bm{z}^i) \).

The training method is surprisingly simple conceptually, at least for this introductory chapter, and the details of its derivation are also not particularly difficult – it involves repeated application of the chain rule with partial derivatives, and the biggest troubles come from getting lost in the indices (and regular checks to make sure that you remembered which way round the dimensions of the \( W^i \) matrices are).

Let \( L \) be the (superscript) index of the output layer. For a given training image, let \( \bm{y}^L \) be the desired outputs; this will be a vector with a 1 in the position of the handwritten digit and zeros elsewhere. I've written \( \bm{y}^L \) with a superscript index, to remind us that it applies to the output layer, but we won't have \( \bm{y} \)'s anywhere else. The actual output of the current network is \( \bm{a}^L \). Define the cost function \( C \) (a function of all the weights and biases) by

\begin{align} \nonumber C &= \frac{1}{2} || \bm{a}^L - \bm{y}^L ||^2 \\ \label{eqn_cost} &= \frac{1}{2} \sum_j (a^L_j - y^L_j)^2. \end{align}

I have sneakily omitted writing the start and end of the sum over \( j \) in the latter sum, mostly so that I leave worries about zero-indexing conventions to the code. It is a sum over the output neurons, running either from 1 to \( N_L \) or from 0 to \( (N_L - 1) \).

The technique now is to use gradient descent, where the gradient \( \nabla \) has a component for every element of every weight matrix \( W^i_{jk} \) and every bias \( b^i_j \). The gradient \( \nabla C \) is a vector pointing in the direction of greatest increase of the scalar field \( C \); since the goal is to minimise the cost function, the procedure is to increment all weights and biases in the direction opposite \( \nabla C \).

Consider one of the components of the gradient, \( \partial C / \partial b^L_j \) (it is easiest to start with the output layer). Equation \eqref{eqn_cost} is written only in terms of the output activations \( a^L_j \), so apply the chain rule:

\begin{equation*} \frac{\partial C}{\partial b^L_j} = \sum_k \frac{\partial C}{\partial a^L_k}\frac{\partial a^L_k}{\partial b^L_j}. \end{equation*}

The partial derivative \( \partial C / \partial a^L_k \) is just \( a^L_k - y^L_k \). The key then to computing the gradient of the cost function is to find the partial derivatives of the output activations with respect to each of the biases and weights. For the output layer, this is straightforward, since

\begin{align*} \frac{\partial a^L_k}{\partial b^L_j} &= \frac{\partial}{\partial b^L_j} \sigma((W^{L-1}\bm{a}^{L-1})_k + b^L_k) \\ &= \frac{d\sigma(z^L_k)}{dz^L_k} \frac{\partial z^L_k}{\partial b^L_k} \\ &= \frac{d\sigma(z^L_k)}{dz^L_k}. \end{align*}

To save space, I'll write \( d\sigma(z^i_k) / dz^i_k \) as \( \sigma'^i_k \). This quantity is easy to compute, since the current \( \bm{z}^i \) values are known, and \( \sigma'(z) = \exp(-z) / (1 + \exp(-z))^2 \), which can also be written as \( \sigma(z)(1 - \sigma(z)) \).

Working backwards through the layers, the procedure will still be to apply the chain rule, and the result will be a growing collection of \( W \)'s and \( \sigma' \)'s. The fortunate aspect is that the derivatives of the earlier layer's variables can be expressed in terms of the derivatives of the subsequent layer, with only one new \( W \) and one new \( \sigma' \). Summing over repeated indices that don't appear on the LHS (!):

\begin{align*} \frac{\partial a^L_k}{\partial b^{L-1}_j} &= \frac{\partial}{\partial b^{L-1}_j} \sigma((W^{L-1}\sigma(W^{L-2}\bm{a}^{L-2} + \bm{b}^{L-2}) + \bm{b}^L)_k) \\ &= \sigma'^L_k W^{L-1}_{kl} \frac{\partial}{\partial b^{L-1}_j} \sigma(W^{L-2}_{lm}a^{L-2}_m + b^{L-1}_l) \\ &= \sigma'^L_k W^{L-1}_{kl} \sigma'^{L-1}_l \delta_{jl} \\ &= \sigma'^L_k \sigma'^{L-1}_j W^{L-1}_{kj}. \end{align*}

The \( \sigma'^L_k \) at the end there is the same as for the derivative for the output layer; any derivatives for still earlier layers (not that we have any for this chapter!) end up with sums over repeated indices at the end, but it all stays as matrix multiplication and element-wise vector multiplication ("Hadamard products", Nielsen tells us). The partial derivative with respect to a weight matrix element \( W^i_{jk} \) is identical to that for \( b^i_j \) but for an extra \( a^i_k \). (That last subscript \( k \) is not the same as the \( k \) above in \( \partial a^L_k \) etc.)

Read Chapter 2 for a full treatment and presentation of this process (or get out a pen and paper), called "back-propagation" because of the way you start with the output layer and then compute the derivatives backwards through the layers.

The cost function is notionally defined over the whole training set, so that all of the above work should be placed inside a loop over all training images. But the gradient defined over the whole training set is approximated quite well by a small subset, so the big loop is usually divided into small "batches", with the weights and biases being incremented opposite from the approximate gradient at the end of each batch. Once all batches in a training set have been trained on, that is called an epoch of training, and then you probably go back to the start and train for some more epochs. When using the batches ("stochastic gradient descent"), the order of the training images is shuffled at the start of each epoch.

There is nothing preventing the algorithm descending gradients to a local rather than global minimum with this procedure, and I expect some sort of stochastic step in future chapters to overcome this limitation.

Data and import

The MNIST page separates the handwritten digits into a training set of 60,000 images and a test set of 10,000. Nielsen makes a further separation of the 60,000 into 50,000 training images and 10,000 validation images, to be left unused until a later chapter. He provides a ready-made dataset in a Python pickle format (which I was only able to load into Python 2.7, not 3.4, because of an encoding issue (???) – something to watch out for if you know as little Python as I do and are trying to follow his procedures). Since I wrote my code in R/Rcpp, I wrote some boring code to work from MNIST directly and re-create Nielsen's dataset for R. It also uses the EBImage package to write the images to file. Show boring code.

Here is what the first handful of test images look like, roughly (this is a screenshot from Ubuntu's file manager):


I store the neural network as an R list, which is a convenient data structure that Rcpp handles after a fashion. Here is the R code, which calls some Rcpp functions that I'll show afterwards:



new_network = function(n) {
  # Initialises a neural network.
  # n is vector containing the number of neurons in each layer
  num_layers = length(n)
  if (num_layers < 2) {
    stop("Need longer input to new_network()")
  network = list()
  length(network) = num_layers
  for (i in seq_along(n)) {
    this_n = n[i]
    if (i < num_layers) {
      next_n = n[i+1]
    # Parameter that I might play with sometimes:
    k = 1
    network[[i]] = list()
    network[[i]]$a = 0*rnorm(this_n)
    network[[i]]$bias = k*rnorm(this_n)
    if (i < num_layers) {
      next_n = n[i+1]
      network[[i]]$weight = matrix(k*rnorm(this_n*next_n), nrow=next_n, ncol=this_n)

start_network = new_network(c(784, 30, 10))

out_network = train_network(images$training, vec_digits$training, start_network, images$test, vec_digits$test, 3, 10, 20)

test_answers = test_network(images$test, out_network)
test_answers_vec = sapply(test_answers, function(x) { which(x == max(x)) - 1 })
true_answers = digits$test

qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE)
print(length(which(true_answers == test_answers_vec)))

The bulk of the code is in Rcpp. I wrote almost everything in element-by-element for loops, though I suspect you could vectorise some of it, perhaps with RcppArmadillo. Editing the contents of a List structure is a little bit painful: you make a shallow copy of the part of the list you want to edit, and then fill that shallow copy in element by element I couldn't get it to work directly setting the shallow copy equal to, e.g., some other NumericVector. But I'm a very, very long way from an expert on Rcpp.

The important routines are feed_forward(), test_network(), and train_network(). Various parameters are passed to train_network(), including the learning rate \( \eta \), which defines how far to increment all variables in the direction opposite the gradient of the cost function, the size of the batches for the approximate gradient calculations, and the number of epochs to train for. As well as training data, the test data is also passed to train_network(), so that the progress of the training can be tracked epoch by epoch.

The fiddliness with the List structures and my use of for loops rather than vectorised or matrix calculations means that my code is quite sprawling compared to Nielsen's Python. My code is also about 20% slower, which probably says something about the relative abilities of me and the people who made numpy.

#include <Rcpp.h>
using namespace Rcpp;

double sigmoid(double z) {
    return 1.0 / (1.0 + exp(-z));

NumericVector clone_vector(NumericVector input) {
  long N = input.size();
  NumericVector v(N);
  for (long i = 0; i < N; i++) {
    v[i] = input[i];
  return v;

NumericMatrix clone_matrix(NumericMatrix input) {
  long N1 = input.nrow();
  long N2 = input.ncol();
  NumericMatrix M(N1, N2);
  for (long i = 0; i < N1; i++) {
    for (long j = 0; j < N2; j++) {
      M(i, j) = input(i, j);
  return M;

// http://gallery.rcpp.org/articles/stl-random-shuffle/
// wrapper around R's RNG such that we get a uniform distribution over
// [0,n) as required by the STL algorithm
inline int rand_wrapper(const int n) { return floor(unif_rand()*n); }

NumericVector shuffle(NumericVector a) {
  // clone a into b to leave a alone
  NumericVector b = clone(a);
  std::random_shuffle(b.begin(), b.end(), rand_wrapper);
  return b;

List feed_forward(NumericVector input, List network) {
  int num_layers = network.size();
  List this_list, prev_list;
  NumericVector a, prev_a, b;
  long i, j, k;
  double sum_z;
  for (i = 0; i < num_layers; i++) {
    this_list = network[i];
    a = this_list["a"];
    if (i == 0) {
      for (j = 0; j < input.size(); j++) {
        a[j] = input[j];
    } else {
      prev_list = network[i - 1];
      prev_a = prev_list["a"];
      b = this_list["bias"];
      NumericMatrix W = prev_list["weight"];
      for (j = 0; j < a.size(); j++) {
        sum_z = 0;
        for (k = 0; k < prev_a.size(); k++) {
          sum_z += W(j, k) * prev_a[k];
        sum_z += b[j];
        a[j] = sigmoid(sum_z);
  return network;

long max_i(NumericVector v) {
  long N = v.size();
  double max_v = max(v);
  for (long i = 0; i < N; i++) {
    if (v[i] == max_v) {
      return i;
  return -1;

// [[Rcpp::export]]
List test_network(List test_images, List network) {
  long N = test_images.size();
  long num_layers = network.size();
  List answers(N);
  List temp_list;
  temp_list = network[num_layers - 1];
  NumericVector a_out = clone_vector(temp_list["a"]);
  long num_out = a_out.size();
  for (long i = 0; i < N; i++) {
    NumericVector this_test = test_images[i];
    network = feed_forward(this_test, network);
    answers[i] = NumericVector(num_out);
    NumericVector v = answers[i];
    temp_list = network[num_layers - 1];
    NumericVector temp_v = clone_vector(temp_list["a"]);
    for (long j = 0; j < num_out; j++) {
      v[j] = temp_v[j];
  return answers;

// [[Rcpp::export]]
List train_network(List training_set, List training_outputs, List network, List test_set, List test_outputs, double eta, long batch_size, long num_epochs) {
  long i, j, k, train_ct, i_train, max_this_batch, i_epoch, i_layer;
  double this_deriv_part, this_deriv_term;
  NumericVector next_a_term, this_a_term;
  NumericMatrix aux_W;
  List aux_list;
  int num_layers = network.size();
  long num_training_images = training_set.size();
  long num_test_images = test_set.size();
  long num_test_correct;
  long num_neurons[num_layers];
  List delta_W(num_layers - 1);
  List delta_b(num_layers);
  List helper_delta_b(num_layers);
  List temp_list;
  for(i = 0; i < num_layers; i++) {
    // Rcpp doesn't seem to like combining the following three lines:
    temp_list = network[i];
    NumericVector this_layer = temp_list["a"];
    num_neurons[i] = this_layer.size();
    delta_b[i] = NumericVector(num_neurons[i]);
    if (i > 0) {
      delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]);
  NumericVector order_images(num_training_images);
  for (i = 0; i < num_training_images; i++) {
    order_images[i] = i;
  for (i_epoch = 0; i_epoch < num_epochs; i_epoch++) {
    if (i_epoch % 1 == 0) {
      Rprintf("Starting epoch %d\n", i_epoch);
    order_images = shuffle(order_images);
    train_ct = 0;

    do {
      // Re-initialise the increment lists:
      for (i = 0; i < num_layers; i++) {
        delta_b[i] = NumericVector(num_neurons[i]);
        if (i > 0) {
          delta_W[i - 1] = NumericMatrix(num_neurons[i], num_neurons[i - 1]);
      max_this_batch = train_ct + batch_size;
      if (max_this_batch > num_training_images) {
        max_this_batch = num_training_images;
      for (i_train = train_ct; i_train < max_this_batch; i_train++) {
        NumericVector this_training = training_set[(long) order_images[i_train]];
        network = feed_forward(this_training, network);
        NumericVector this_training_output = training_outputs[(long) order_images[i_train]];
        for (i = 0; i < num_layers; i++) {
          helper_delta_b[i] = NumericVector(num_neurons[i]);
        List temp_list = network[num_layers - 1];
        NumericVector output_a = clone_vector(temp_list["a"]);
        NumericVector output_d_sig = clone_vector(temp_list["a"]);
        // Start with the output layer.
        NumericVector this_helper = helper_delta_b[num_layers - 1];
        for (j = 0; j < num_neurons[num_layers - 1]; j++) {
          output_d_sig[j] = output_d_sig[j] * (1.0 - output_d_sig[j]);
          this_helper[j] = (output_a[j] - this_training_output[j]) * output_d_sig[j];
        // Back-propagate:
        for (i = num_layers - 2; i >= 1; i--) {
          NumericVector this_helper = helper_delta_b[i];
          NumericVector next_helper = helper_delta_b[i + 1];
          List temp_list = network[i];
          NumericMatrix W = temp_list["weight"];
          NumericVector this_a = clone_vector(temp_list["a"]);
          for (j = 0; j < num_neurons[i]; j++) {
            this_helper[j] = 0.0;
            for (k = 0; k < num_neurons[i + 1]; k++) {
              this_helper[j] += next_helper[k] * W(k, j) ;
            this_helper[j] *= this_a[j] * (1.0 - this_a[j]);
        for (i = num_layers - 1; i >= 1; i--) {
          temp_list = network[i - 1];
          NumericVector prev_a = clone_vector(temp_list["a"]);
          NumericVector this_helper = helper_delta_b[i];
          NumericVector this_delta_b = delta_b[i];
          NumericMatrix this_delta_W = delta_W[i - 1];
          for (j = 0; j < num_neurons[i]; j++) {
            for (k = 0; k < num_neurons[i - 1]; k++) {
              this_delta_W(j, k) += this_helper[j] * prev_a[k];
            this_delta_b[j] += this_helper[j];
      // Increment weights and biases:
      for (i = 0; i < num_layers - 1; i++) {
        List temp_list = network[i];
        NumericMatrix W = temp_list["weight"];
        List temp_list2 = network[i  + 1];
        NumericVector b = temp_list2["bias"];
        NumericMatrix this_delta_W = delta_W[i];
        NumericVector this_delta_b = delta_b[i + 1];
        for (j = 0; j < num_neurons[i + 1]; j++) {
          for (k = 0; k < num_neurons[i]; k++) {
            W(j, k) -= eta*this_delta_W(j, k);
          b[j] -= eta*this_delta_b[j];
      train_ct = max_this_batch;
    } while (train_ct < num_training_images - 1);
    // See how we're doing.
    List test_answers = test_network(test_set, network);
    num_test_correct = 0;
    for (i = 0; i < num_test_images; i++) {
      NumericVector this_ans = test_answers[i];
      NumericVector this_true = test_outputs[i];
      if (max_i(this_ans) == max_i(this_true)) {
    Rprintf("Score of %d out of %d\n", num_test_correct, num_test_images);
  return network;


The results for most of last week were weird: where Nielsen said that the network should be reaching 95% accuracy, I was seeing, at best, 60%. How do you get so much better than chance but so much worse than expected? After combing through everything line-by-line (and finding one, inconsequential, bug), re-writing the back-propagation algorithm after having read chapter 2, I eventually tried to get the Python code running and compare the intermediate results after a single iteration.

And in trying to ensure that all inputs were identical between R code and Python code, I eventually found the cause of my week's frustration. The input images are scaled so that the background is black (pixel value 0) and the pen strokes are white (maximum pixel value 255, then re-scaled by me to a maximum of 1). I inverted this scale so that the images had a white background.

For reasons that are totally unclear to me, my convention on the pixel values caused the network to get stuck in bizarre-looking local optima: maybe four or five different digits would get correctly classified most of the time, but the rest would all end up as 8's, for example. I never got much above 60% accuracy, and frequently would be down at 20%-30%.

I have no idea why a framework that's powerful enough to recognise handwritten digits gets stuck if the digits are written with dark ink on white paper instead of white ink on dark paper.

Nevertheless, once I did type "1.0 - " in a couple of places, everything ran as it was supposed to, with accuracy around 94% to 95% after a training epochs. Here are the first 200 test images, with the classifications of the network (the output neuron activations don't necessarily sum to 1; I've normalised them so that I can interpret the results as "probabilities"). There are only 6 wrongly classified digits in this set, so it is a little bit better than average.

The errors are interesting to look at. The first is a 3 that is very close to being an 8. A 9 drawn with the vertical stroke at a steep angle was read as a 7. It had great trouble with the 2 that looks like a random squiggle. And so on. Some of the errors make sense, though some are more mysterious. As I progress through the book and learn to improve the training of the network(s), I'll re-do this results table and watch how the networks learn to cope with the irregularities currently causing it some problems.

It's interesting to test the method without a hidden layer of neurons (this is an exercise suggested in the book, and a very natural one to try). With the other parameters the same, with three different initialisations I got accuracies of 84%, 84%, and 66% (it took between 6 and 30 epochs to reach roughly those values). Interestingly, in all three runs it failed miserably on classifying 8's. Of course I have no idea why 8's in particular should cause problems for this system, and perhaps I never will know – my hope is that the sort of techniques used for Google's DeepDream might shed some light on issues like this, but at this early stage of my learning, I have no idea if it can be applied or not.


After spending most of my spare time this week chasing down the bug caused by the neural network's bizarre inability to handle input images with white backgrounds, I'm happy to have got it working. I expect the next couple of chapters will be incremental improvements – working out how to initialise the weights and biases, how to set the learning rate parameter, introducing some random element to the gradient descent to avoid local minima, etc.

I don't think I've fully internalised how unusual training a neural network is. The network I used has almost 24,000 free parameters, all being sent through at least one sigmoid function, and I haven't the slightest idea of what features it's "looking for" when trying to classify digits. There's an obvious parallel to not having the slightest idea how human intelligence works. On the other hand, a part of me just thinks, "Eh, you've got 24,000 free parameters to try to put 784-pixel images into one of ten categories – that sounds like it ought to work, and its success is unremarkable."

I should probably have a try at some non-neural attempts at the same problem. (Kaggle has a simple random forest R script; trying to run in on the full training set uses more memory than my computer has, but with 10,000 training images it gets to about 58% accuracy. Interestingly, the scatter plot of true versus predicted digits showed that the random forest more likely to make small numerical errors, rather than confuse similar-looking shapes. When I get some time and motivation for this, I'll have to try to understand what's going on, since at the moment I'm just treating it as a black box.) I don't expect that something I coded without learning any other machine learning techniques would be a strong benchmark, but it's an exercise I'll hopefully get around to before year's end.

Chapter 2 of Nielsen's book is on the derivation of the back-propagation algorithm, so my next post will either be on chapter 3's techniques, or me playing with toy models.

Posted 2015-10-17, updated 2015-10-25 (minor code edit).

Home > Misc > Neural networks