Non-neural benchmarks

Non-neural MNIST benchmarks

Introduction

I've mentioned a couple of times that I'd like to try some non-neural approaches to classifying the MNIST digits, so that I can have a better idea of how good the neural networks (and other machine learning algorithms, if I ever work with them) really are. This is just me – I'm not pretending that my results here are world-class classification benchmarks, they're just what I might have come up with if I'd been given the problem to classify the MNIST digits and didn't know where to start.

Having read through to chapter 3 of Michael Nielsen's book, my best neural network has reached 98.1% accuracy on MNIST. Earlier, in chapter 1, Nielsen had written:

It's not difficult to find other ideas which achieve accuracies in the 20 to 50 percent range. If you work a bit harder you can get up over 50 percent. But to get much higher accuracies it helps to use established machine learning algorithms.

I think that either this is underselling how easy the MNIST dataset is to classify, or "machine learning algorithms" include a broader array of pretty simple techniques than I'd thought (the Wikipedia machine learning template includes "linear regression", so maybe it's the latter). Dotting the test image with the means of each digit's training images gets me to 82%. With some more careful thought about PCA, I get 94%. With k-nearest-neighbours (which does look like an "official" machine learning algorithm, albeit a very simple one) I get over 97%. So, at least on MNIST, any advantage of neural networks for classification purposes is really just trying to squeeze as much out of the data as possible, rather than being an enormous amount better than simple alternatives.

Comparing test images to digit means

This is a very simple algorithm:

Treat each 28 × 28 image as a 784-length vector.
For each digit 0-9, calculate the mean vector from the training set.
For each test image, find the dot product with each of the training means.
Return the digit with the highest dot product.

Vectors should be normalised, so as not to favour digits with more light pixels (forgetting to do this led to me getting about 60% accuracy and therefore thinking that my then-weird PCA algorithm was good for getting 70%). On the test set it gets 8236/10000 correct. Jittered scatterplot of test answers against true answers:

R code:

library(ggplot2)

load("../digit_data/r_images_nielsen.rda")

digit_image_matrices = list()
mean_digits = list()
length(digit_image_matrices) = 10
length(mean_digits) = 10

# Training
for (i in 0:9) {
  this_i = which(digits$training == i)
  num_digits = length(this_i)
  
  digit_image_matrices[[i + 1]] = matrix(0, nrow=num_digits, ncol=784)
  for (j in 1:num_digits) {
    digit_image_matrices[[i + 1]][j, ] = images$training[[this_i[j]]]
  }
  
  mean_digits[[i + 1]] = colMeans(digit_image_matrices[[i + 1]])
  
  this_norm = sqrt(sum(mean_digits[[i + 1]] * mean_digits[[i + 1]]))
  mean_digits[[i + 1]] = mean_digits[[i + 1]] / this_norm
}

# Testing
test_answers = list()
num_test_images = length(digits$test)
length(test_answers) = num_test_images
test_answers_vec = numeric(num_test_images)

for (i in 1:num_test_images) {
  test_answers[[i]] = numeric(10)
  
  for (j in 1:10) {
    this_img = images$test[[i]]
    this_norm = sqrt(sum(this_img*this_img))
    this_img = this_img / this_norm
    test_answers[[i]][j] = sum(mean_digits[[j]] * this_img)
  }
  
  test_answers[[i]] = test_answers[[i]] / sum(test_answers[[i]])
  test_answers_vec[i] = which(test_answers[[i]] == max(test_answers[[i]]))[1] - 1
}

true_answers = digits$test

qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE)
print(length(which(test_answers_vec == true_answers)))

PCA

My first attempt at a PCA-based classification method felt wrong but got me to what I then thought was a respectable 70%. After working through the theory of PCA carefully (see my notes on the subject), I came up with a better way of using principal components to classify digits, with the accuracy over 94%.

The logic behind the method goes like this. Consider all the '0' digit images, treat them as vectors of length 784, and put them in a big matrix with 784 columns. Calculate the first few principal components of this data matrix – (too?) loosely speaking, these should correspond to the main ways in which the digit is drawn. Repeat for the other digits, so that there is now a collection of each digit's first few principal components.

Given a test image, the algorithm goes:

For each digit 0-9:

Convert the test image to the digit's first few factors.
Back-transform from these factors to a 784-length vector representing an approximation to the original image.
Calculate a score based on how far the back-transformed image differs from the original.

The digit that scored the best is the returned classification.

The question is how many principal components to keep. Using all 784 is obviously wrong, since every digit's full set of factors will perfectly transform and back-transform any test image to itself. Using only the first factor will miss lots of variations on the digit (for example, if it's shifted a little from where the first factor says it should be). After a bit of playing around, I found that about 30 factors gave near-optimal results, with 9462 out of the 10000 test images classified correctly. Jittered scatterplot:

R code (sorry for the inconsistency between lists and matrices):

library(ggplot2)

load("../digit_data/r_images_nielsen.rda")

digit_image_matrices = list()
digit_pca = list()
digit_means = list()
length(digit_image_matrices) = 10
length(digit_pca) = 10
length(digit_means) = 10

# Training
for (i in 0:9) {
  print(i)
  this_i = which(digits$training == i)
  num_digits = length(this_i)
  
  digit_image_matrices[[i+1]] = matrix(0, nrow=num_digits, ncol=784)
  for (j in 1:num_digits) {
    digit_image_matrices[[i+1]][j, ] = images$training[[this_i[j]]]
  }
  
  digit_pca[[i+1]]$pca = prcomp(digit_image_matrices[[i+1]], center=TRUE)
}


for (i in 0:9) {
  # Calculate means by digit for use in centring the test images.
  digit_means[[i + 1]] = colMeans(digit_image_matrices[[i + 1]])
}

# Testing

# Number of principal components to use:
num_evecs = 30

num_test = length(digits$test)
test_image_matrix = matrix(0, nrow=num_test, ncol=784)

for (j in 1:num_test) {
  test_image_matrix[j, ] = images$test[[j]]
}

test_answers = matrix(0, nrow=num_test, ncol=10)
test_answers_vec = numeric(num_test)

for (i in 0:9) {
  print(i)
  this_centred_matrix = test_image_matrix
  
  for (j in 1:num_test) {
    this_centred_matrix[j, ] = this_centred_matrix[j, ] - digit_means[[i + 1]]
  }
  
  # Convert the images to the first num_evecs factors, then convert back and
  # see how good the reproduction of the original image is.
  reduced_images = t(digit_pca[[i + 1]]$pca$rotation[ , 1:num_evecs]) %*% t(this_centred_matrix)
  back_transf_images = t(digit_pca[[i + 1]]$pca$rotation[ , 1:num_evecs] %*% reduced_images)
  
  comparison = abs(this_centred_matrix - back_transf_images)
  test_answers[ , i + 1] = rowSums(comparison)
}

test_answers_vec = apply(test_answers, 1, function(x) which(x == min(x)) - 1)
true_answers = digits$test[1:num_test]

qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE)
print(length(which(test_answers_vec == true_answers)))

The nearest neighbour to \( T \) across the training images is simply the image \( T^i \) with the smallest distance \( d^i \), and the simplest implementation of k-nearest neighbours is just "1-nearest neighbour", and (for MNIST) returns the digit of this \( T^i \). More generally, for \( k > 1 \), the classification will be the most common of the \( k \) nearest neighbours in the training set.

The class R package implements k-NN and works pretty much straight out of the box, and it classified the digits so well that I thought I might have made a mistake, accidentally feeding the true answers into the system somewhere. I wrote my own implementation of k-NN just to make sure it was all as simple and effective as it seemed, and my Rcpp code follows shortly.

Wikipedia says that k-NN can struggle with high-dimensional data sets, but the 784 dimensions of MNIST posed no problems and the 1-nearest-neighbour got 96.7% of the test images correct. Nevertheless, the algorithm is both much faster and slightly more accurate working in reduced dimensions, and using the first 30 principal components and \( k = 3 \) brings the score up to 9738/10000. With my code, it took about 25 minutes to work through the classifications with 784-length vectors, and about half a minute with 30-length vectors. Jittered scatterplot:

(Unlike in the previous section, the principal components are calculated over the full training dataset, and not separately by digit.)

As in a previous post, I use Romain François's code to return sorted indices for me. The main function to be called from R is k_nearest_classif(), and it is less than 40 lines long. Rcpp:

#include <Rcpp.h>
#include <queue>
using namespace Rcpp;

// http://gallery.rcpp.org/articles/top-elements-from-vectors-using-priority-queue/
template <int RTYPE>
class IndexComparator {
public:
  typedef typename Rcpp::traits::storage_type<RTYPE>::type STORAGE ;
  
  IndexComparator( const Vector<RTYPE>& data_ ) : data(data_.begin()){}
  
  inline bool operator()(int i, int j) const {
    return data[i] > data[j] || (data[i] == data[j] && j > i ) ;    
  }
  
private:
  const STORAGE* data ;
} ;

template <>
class IndexComparator<STRSXP> {
public:
  IndexComparator( const CharacterVector& data_ ) : data(data_.begin()){}
  
  inline bool operator()(int i, int j) const {
    return (String)data[i] > (String)data[j] || (data[i] == data[j] && j > i );    
  }
  
private:
  const SEXP* data ;
} ;

template <int RTYPE>
class IndexQueue {
public:
  typedef std::priority_queue<int, std::vector<int>, IndexComparator<RTYPE> > Queue ;
  
  IndexQueue( const Vector<RTYPE>& data_ ) : comparator(data_), q(comparator), data(data_) {}
  
  inline operator IntegerVector(){
    int n = q.size() ;
    IntegerVector res(n) ;
    for( int i=0; i<n; i++){
      // +1 for 1-based R indexing [deleted -- DB.]
      res[i] = q.top();
      q.pop() ;
    }
    return res ;
  }
  inline void input( int i){ 
    // if( data[ q.top() ] < data[i] ){
    if( comparator(i, q.top() ) ){
      q.pop(); 
      q.push(i) ;    
    }
  }
  inline void pop(){ q.pop() ; }
  inline void push( int i){ q.push(i) ; }
  
private:
  IndexComparator<RTYPE> comparator ;
  Queue q ;  
  const Vector<RTYPE>& data ;
} ;


template <int RTYPE>
IntegerVector top_index(Vector<RTYPE> v, int n){
  int size = v.size() ;
  
  // not interesting case. Less data than n
  if( size < n){
    return seq( 0, n-1 ) ;
  }
  
  IndexQueue<RTYPE> q( v )  ;
  for( int i=0; i<n; i++) q.push(i) ;
  for( int i=n; i<size; i++) q.input(i) ;   
  return q ;
}

// [[Rcpp::export]]
IntegerVector top_index( SEXP x, int n){
  switch( TYPEOF(x) ){
  case INTSXP: return top_index<INTSXP>( x, n ) ;
  case REALSXP: return top_index<REALSXP>( x, n ) ;
  case STRSXP: return top_index<STRSXP>( x, n ) ;
  default: stop("type not handled") ; 
  }
  return IntegerVector() ; // not used
}

// [[Rcpp::export]]
IntegerVector k_nearest_classif(NumericMatrix training, IntegerVector training_outputs, NumericMatrix test,
                                int k = 1) {
  long i, j, ct;
  long num_training = training_outputs.size();
  long num_test = test.nrow();
  long num_pixels = test.ncol();
  NumericVector distances(num_training);
  
  long min_class = min(training_outputs);
  long max_class = max(training_outputs);
  long num_classes = max_class - min_class + 1;
  
  IntegerVector nearest(k);
  IntegerVector classifs(num_test);
  
  for (i = 0; i < num_test; i++) {
    // Calculate distances to training images:
    for (j = 0; j < num_training; j++) {
      distances[j] = 0.0;
      for (ct = 0; ct < num_pixels; ct++) {
        distances[j] -= (training(j, ct) - test(i, ct)) * (training(j, ct) - test(i, ct));
      }
    }
    
    // Find k nearest:
    nearest = top_index(distances, k);
    IntegerVector classes(num_classes);
    
    for (j = 0; j < k; j++) {
      classes[training_outputs[nearest[j]] - min_class] += 1;
    }

    IntegerVector this_classif_i = top_index(classes, 1);
    classifs[i] = this_classif_i[0] + min_class;
  }
  
  return classifs;
}

library(Rcpp)
library(ggplot2)
library(EBImage)

load("../digit_data/r_images_nielsen.rda")
sourceCpp("knn_by_me.cpp")

num_training = length(digits$training)
num_test = length(digits$test)

training_image_matrix = matrix(0, nrow=num_training, ncol=784)
test_image_matrix = matrix(0, nrow=num_test, ncol=784)

# Training

for (j in 1:num_training) {
  training_image_matrix[j, ] = images$training[[j]]
}

# Doing the PCA manually because prcomp soaked up too much memory (?? -- maybe I did something wrong with it).

digit_means = colMeans(training_image_matrix)

for (j in 1:num_training) {
  training_image_matrix[j, ] = training_image_matrix[j, ] - digit_means
}

for (j in 1:num_test) {
  test_image_matrix[j, ] = images$test[[j]] - digit_means
}

var_covar = cov(training_image_matrix)

# Remove zero-variance pixels (not needed unless the variances are normalised)
nonzero_i = which(diag(var_covar) > 0)
var_covar = var_covar[nonzero_i, nonzero_i]
training_image_matrix = training_image_matrix[ , nonzero_i]
test_image_matrix = test_image_matrix[ , nonzero_i]

# Usually you'd want to normalise the variances, but I get very marginally
# worse results when I do this:
# 
# pixel_stdevs = sqrt(diag(var_covar))
# pixel_stdevs = pixel_stdevs[nonzero_i]
# 
# for (i in 1:num_training) {
#   training_image_matrix[i, ] = training_image_matrix[i, ] / pixel_stdevs
# }
# 
# for (i in 1:num_test) {
#   test_image_matrix[i, ] = test_image_matrix[i, ] / pixel_stdevs
# }

eigenthings = eigen(var_covar)

new_dim = 30
reduced_training_matrix = training_image_matrix %*% eigenthings$vectors[ , 1:new_dim]
reduced_test_matrix = test_image_matrix %*% eigenthings$vectors[ , 1:new_dim]

test_answers_vec = k_nearest_classif(reduced_training_matrix, digits$training, reduced_test_matrix, k=3)
true_answers = digits$test

qplot(true_answers, test_answers_vec, alpha=.1, position=position_jitter()) + guides(alpha=FALSE)
print(sprintf("score = %d", length(which(test_answers_vec == digits$test))))

Non-neural MNIST benchmarks

Introduction

Comparing test images to digit means

PCA

k-nearest neighbours

Conclusions