Playing around with neural networks and R

I was recently confronted through my work to a classification problem : given a set of explanatory variables, which category a player will most likely end up in (I work in the videogame industry).

To be frank, my statistical knowledge was a little rusty since I have been doing web-dev for a year (unfortunatly stats are not like riding a bike : you do forget after a while). So I ended up doing a quick litterature review in order to list the tools that could help me with this task.

I began with a logistic regression but wasn’t that happy with the accuracy of the result and the implementation was not that easy due to the high volume of data I was dealing with.

Through my readings, I came in contact with various techniques of machine learning and was eager to try them out. I’ve heard about it in the paste but it seemed like a mystical and out of reach corner of computer and information sciences.

And I was wrong. It’s accessible, it works and it’s a lot of fun (well, data-scientist-kinda-fun). What follows is my naive attempt at solving a problem that puzzled me for a while somehow : automatic shape recognition. I mean, how long did it took us as kids to be able to put those damn educational toys into the rightly shaped slot ? Well, quite a long time after all…

Before going any further, let’s review quickly the basics of a neural network. If you’re already familiar with the concept, you might as well forget about the next section and jump to the code below.

What is an Artificial Neural Network (ANN) ?

First, let’s go one level up and ask ourselves : what is machine learning ? I won’t go into the details because I intend to write a post about it. So here is a quick definition :

Machine learning is about making computers modify or adapt their actions so that these actions get more accurate over time. So the idea behind learning is that percepts should be used not only for acting, but also for improving the system’s ability to act in the future.

Likewise to animal learning, the process goes through 3 steps :

remembering : recognizing the last time we were in a particular situation,
adapting : trying to reproduce the process that led to the correct action ; if reproducing doesn’t lead to the right result, this is where the correction will take place.
generalising : recognizing similarities between different situations in order to broaden the range of application of our conclusions.

In animals, learning occurs within the brain. It deals with high-dimensional, noisy and often inconsistent data from which it quickly produces optimal answers regarding the goal(s) of the system.
So scientists from several disciplines (computer sciences, biology, mathematics…) tried to solve the machine learning problem by bio-mimickry.
Their idea was to model and simulate neural biological nervous systems by combining many simple computing elements (neurons) into a highly interconnected system in the hope that these systems could give computers at least some of the cognitive power which can be found in the living realm.

How does a neuron work ?

A neuron (or nerve cell) is the fundamental functional unit of all nervous system tissue. Each neuron consist of a cell body holding the cell’s nucleus. Branching out from the cell body are a number of fibres called dendrites and a single long fibre called the axon. Dendrites branch into a bushy network around the cell, whereas the axon stretches out for a long distance. Eventually, the axon also branches into strands that connect to the dendrites and cell bodies of other neuron. The connecting junction is called a synapse.

biological_neuron

Signals are propagated from neuron to neuron by a complicated electrochemical reaction. Chemical transmitter substances are released from the synapses and enter the dendrite, raising and lowering the electrical
potential of the cell body. When the potential reaches a threshold, the neuron spikes or fires and an electrical pulse (action potential) of fixed strength and duration is sent down the axon. After firing, the neuron must wait for some time to recover its energy (refractory period) before it can fire again.
Synapses that increase the potential are called excitatory, and those that decrease it are called inhibitory. They exhibit plasticity : the strength of their connection can vary over time in response to the pattern of stimulation they experienced, and new connections can be made.
Each neuron can be viewed as a separate processor, performing a very simple computation : deciding weather or not to fire. This makes the brain a massively parallel computer.

The resulting model :

Remember that before going on a never-ending description of the neural system, I was saying that we want to produce a simplified model of the nervous system in order to “steal” some if its magic (its cognitive power). Hopefully some much brighter mind than mine tackled the problem and came up with a solution some 50 years ago or so.

Behold ! The artificial neural network (ANN). It is composed of a number of nodes or units (neurons) connected by links (synapses). Each link has a numeric weight (action potential) associated with it.
Very basically, learning usually takes place by updating the weights.
Some of the units are connected to the external environment and can be designated as input or output units.
The weights are modified so as to try to bring the network’s input/output behaviour more into line with that of the environment.
Each unit has a set of input links from other units, a set of output links to other units, a current activation level and a means of computing the activation level at the next step in time, given its inputs and weights.
The idea is that each unit does a local computation based on inputs from its neighbours.

A basic neural network is composed of several layers of neurons : an input layer, one or more hidden layers and an output layer. Each layer of neurons receives its input from the previous layer or from the input layer. Each neuron is connected to all neurons in preceding and following layer. The output of each neuron feeds the next layer or the output layer of the network.

ANN_diagram

Different variations of the model exist. For now on we will focus on the feed-forward architecture since it’s a very common implementation of ANN.

The learning process explained :

The network learns by :

reading a training vector from the training set (explanatory variables).
sending the vector of inputs through the network (forward-propagation),
comparing the output of the network with the true output (dependant variable),
computing an error term,
adjusting all weights by rationing the error throughout the network (back-propagation).

This process repeats itself with the next training pair and continues until all errors are small enough. Once the error has reached a minimum, the network is considered trained and the existing weights associated with the nodes represent the network’s knowledge.

Since my goal is not to bore you to death (at least not that fast), I won’t dive in deeper in the mathematical framework underlying these operations. This will lead yet again to another post.

Why would I need such a thing ?

I will speak for my trade here, but to me apart from the “cool” factor associated with ANN, it is a very interesting tool that I now like to use for my analysis.
ANN can be applied to approximate any complex functional relationship. Unlike generalized linear models, it is not necessary to specify at the beginning the type if relationship between inputs and outputs. Recently I had to deal with survival-type data, and it was quite a pain in the **** to find and develop a custom weibull distribution function that would fit my reel distribution.
With an ANN, no problem of the sort, which makes it a valuable assets for a wide range of fields. Scientists and engineers use them to model and predict complex phenomena. Investment bankers are creating investment models to better manage money and improve profits etc.
The main drawback is that it is not a formal modelling process. The weights cannot be explicitly interpreted as opposed to the coefficients of a regression. So this technique is more a of a useful complement to traditional modelling methods rather than a substitute.

Pfiou, that’s enough chit-chat for now don’t you think ? Let’s see what we can do with this thing !

An attempt to classify shapes using an ANN :

What do we want to do ?

The idea here is very simple : we would like to train our network so that it will be able to recognize 3 basic shapes – rectangle, triangle and circle. For this purpose I created a “dataset” of 15 (for training) + 8 (for testing) png files per shape. Each image is a 32*32 pixels png representing a distinct shape (I’m quite ashamed to say that it was generated manually on paint…but I told you, it’s quite a silly example).

So how are we going to do it ? We need to feed the images to the network as inputs, and give it the correct output so that it can learn to reproduce the relation between the two.

First thing first. We will obviously need to perform some kind of transformation on the images so that we can make it a proper input for the network. So what is a proper input ? It has to be an array of values, ranging between 0 and 1 (sometimes -1 and 1). Each input neuron will thus accept a single value from the array. Now, what is the numerical representation of our images ? At first glance it looks like a 32 rows * 32 columns matrix with a cell value = 1 if the pixel is white and 0 if it is black. That’s a good start. But unfortunately we need to feed each input neuron with a single value, not a whole matrix or a vector. Which means that we will need two things : 32*32 = 1024 input neurons and a way to transform our matrix in a one dimension vector holding our pixel values.

Let’s take a look at the output. This is a categorisation problem : we must tell our network that the result of the input is either a rectangle, a triangle or a circle. The value of the output is not important as long as it discriminate between the 3 classes. The traditional approach in such a case is to use a 1-of-N encoding : in a categorisation problem with N classes, N output neurons will be created. For an input vector, the associated output vector will hold the value 1 for the output corresponding to the input’s class, 0 for the others. Back to our situation : we have 3 classes, thus we will have 3 output neurons. Let’s imagine an input corresponding to a rectangle, the associated output vector would be [1, 0, 0].

That’s that for the inputs and the outputs. But what happens in between ? We need to choose a number of hidden layers, and a number of nodes (hidden neurons) for each hidden layer. How are we going to do that ? Well the bad news is that no rule of thumb exists as to how to choose the optimal number of intermediary elements. It’s going to be a game of trial and errors in order to find the configuration that best suit your problem. When the number of hidden nodes and layer of an ANN change, the implied functional relationship modelled changes as well. For instance, a neural network built without a hidden layer is directly comparable to a linear regression. I hear you already : “with an infinite potential number of hidden nodes and layers, how am I going to systematically chose the right set-up ?”. Well first ease of implementation and computation will quickly limit the number of nodes and layers since the more complexity in the network, the more processing power is needed to make it run. Furthermore, and this is part of the good news, only 2 hidden layers can represent almost every functional relationship (to understand why, I recommend eating two volumes of Calculus each morning during breakfast). Based on this assumption (and this is the good news), I wrote a little piece of code that computes every configuration you would like to try so that you can pick the most efficient.

The data

How much data will be needed to train out network ? One of the books I went through gives the value of 10 times the number of weights in order to get an accurately trained network. The number of weights can be found with the following formula (only if the number of hidden neurons is the same in each hidden layers) :

w = (m+1)*n + (n+1)*n*(h-1)+(n+1)*p

with :
    – m = number of input neurons
    – n = number of hidden neurons
    – h = number of hidden layers
    – p = number of output neurons

Since I didn’t want to spend my day messing around with paint, I only generated 15 training + 8 testing images per shape (so 69 images total).

You can download it here.

The code

All the code is written in R. I will probably write the Python equivalent soon (yeah right, as if soon meant something).

So what should we begin with ? Remember that before jumping into the ANN, we need to import and reformat our images. We use the package EBImage in order to import the png files as a matrix :

EBImage is an R package which provides general purpose functionality for the reading, writing, processing and analysis of images. Furthermore, in the context of microscopy based cellular assays, EBImage offers tools to transform the images, segment cells and extract quantitative cellular descriptors.

EBImage will allow us to import our png as matrix of values. It should recognize that our files are in greyscale, but in that case it doesn’t (I feel like it’s because of the paint-origin of the files >< ) so the image-object created by the package holds 3 frames (for the RGB values of the pixels) instead of 1. It doesn’t matter, any of these 3 will do since we are only interested in 2 color dimensions (black and white).

#To install the package : 
source("http://bioconductor.org/biocLite.R")
biocLite("EBImage")
library(EBImage)

Now that we have our 32 pixels * 32 pixels matrix, we then need to reformat it as a vector. This is done by the vectorize function that just binds row one after another in a 1 dimension array.

The function image.process follows all these steps, taking a path to the image file as input and returning its “vectorized” version.

image.vectorize <- function(x){
  dim <- ncol(x)*nrow(x)
  out <- matrix(ncol = dim)
  
  for (i in 1:nrow(x)){
    from = (i-1)*ncol(x)+1
    to = from + ncol(x)-1
    out[1, from:to] <- x[i,]
  }
  return(out)
}

image.process <- function(filepath) {
  image <- readImage(filepath)
  single.frame <- getFrame(image, 1)
  values.matrix <- imageData(single_frame)
  values.vector <- image.vectorize(values_matrix)
  
  return(values.vector)
}

Let's load our images training set and generate the corresponding output:

path <- "C:/.../Neural Networks/Visual classification/shapes/training/"
in.train <- matrix(nrow=45, ncol=1024)

##Loading rectangles
for (i in 1:15) {
  file = paste(path, "rectangle", as.character(i), ".png", sep="")
  in.train[i,] <- image.process(file)
}

##Loading trangles
for (i in 1:15) {
  file = paste(path, "triangle", as.character(i), ".png", sep="")
  in.train[i+15,] <- image.process(file)
}

##Loading circles
for (i in 1:15) {
  file = paste(path, "circle", as.character(i), ".png", sep="")
  in.train[i+30,] <- image.process(file)
}

##1 for rectangle, 2 for triangle and 3 for circle
out1 <- matrix(c(rep(1,15), rep(0,30)))
out2 <- matrix(c(rep(0,15), rep(1,15), rep(0,15)))
out3 <- matrix(c(rep(0,30), rep(1,15)))

df.train <- data.frame(input=in.train, out1=out1, out2=out2, out3=out3)

For the network itself : I used the neuralnet package because of its ease of implementation and flexibility. Like in a regression, you need to provide a formula defining the link between the covariates and the response variable(s). In our case the responses are out1, 2 and 3 and the covariates are every individual pixel in the images. So our formula will look something like out1+out2+out3 ~ input1 + ... + input1024 :

formula <- "out1+out2+out3~input.1"
for (i in 2:ncol(inputs)){
  input <- paste("input.", i, sep="")
  formula <- paste(formula, input, sep="+")
}

Next, train the NN :

library(neuralnet)
nn <- neuralnet(formula, df.train, hidden=c(10,10), linear.output=FALSE, threshold=0.0001, rep=20)
nn

Test it on another set of images. The compute function is part of the neuralnet package. It computes the values predicted by the trained model provided. The inputs must be in the same format as the one provided to train the network.

path <- "C:/.../Neural Networks/Visual classification/shapes/training/"
in.test <- matrix(nrow=24,  ncol=1024)
##Loading rectangles
for (i in 1:8) {
  file = paste(path, "rectangle", as.character(i), ".png", sep="")
  in.test[i,] <- image.process(file)
}

##Loading trangles
for (i in 1:8) {
  file = paste(path, "triangle", as.character(i), ".png", sep="")
  in.test[i+8,] <- image.process(file)
}

##Loading circles
for (i in 1:8) {
  file = paste(path, "circle", as.character(i), ".png", sep="")
  in.test[i+16,] <- image.process(file)
}

df.test <- data.frame(input=in.test)

out1 <- c(rep(1,8), rep(0,16))
out2 <- c(rep(0,8), rep(1,8), rep(0,8))
out3 <- c(rep(0,16), rep(1,8))

out.ideal <- matrix(c(out1, out2, out3), ncol=3)

for (i in 1:24){
  prediction <- compute(nn, df.test[i,], rep=1)
  print(round(prediction$net.result, 2))
}

That's one way to do it. But recall that I told that I crafted a little function (actually two) in order to process a lot of networks with different configurations in order to compare their predictive accuracy (in that case, the % of correct answers). nn.accuracy computes a single trained network and compares its output to an ideal output. It then returns the % of correct answers. The function nn.benchmark will train multiple networks with a different number of nodes in the 2 hidden layers each time. Each configuration is run 10 times in order to get the average predictive power (for the answer to the question : "why would the accuracy variate if the configuration and the data stay the same ?", you will have to wait for my post about the math behind ANN). The results are then stored in a dataframe in order to compare every configuration computed.

The function is supposed to be fail-proof, which means that if a configuration doesn't reach an equilibrium (in other words if the network doesn't converge toward a solution), the function wont crash and the corresponding row in the dataframe will be filled with "NAs".

nn.accuracy <- function(nn, in.test, out.ideal, rep) {
  ncols = ncol(out.ideal)
  nrows = nrow(out.ideal)
  out.actual <- matrix(ncol = ncols, nrow=nrows)
  correct.answers <- 0
  
  for (i in 1:nrow(in.test)){
    out.actual[i,] <- round(compute(nn, in.test[i,], rep=rep)$net.result, 3)
    for (j in 1:ncols){
      out.actual[i,j]<- ifelse(j==which.max(out.actual[i,]), 1, 0)
    }
    if (which.max(out.actual[i,]) == which.max(out.ideal[i,])){
      correct.answers <- correct.answers + 1
    }
  }
  
  correct.percentage <- round((correct.answers/nrows)*100, 2)
  return(correct.percentage)
}


nn.benchmark<- function(formula, training.data, testing.data, out.ideal, min.nodes.1layer, max.nodes.1layer, min.nodes.2layer, max.nodes.2layer, threshold, reps) {
  nrows <- ((max.nodes.1layer-min.nodes.1layer)+1)*((max.nodes.2layer-min.nodes.2layer)+1)
  
  benchmark.result <- matrix(ncol=6, nrow=nrows)
  result.row <- 1
  
  nn.results <- matrix(nrow=1, ncol=reps)
  
  for (i in min.nodes.1layer:max.nodes.1layer) {
    for (j in min.nodes.2layer:max.nodes.2layer) {
      print(paste("running the network with", i, "node(s) on the 1st hidden layer and", j, "node(s) on the 2nd"))
      try({
        nn.tested <- neuralnet(formula, training.data, hidden=c(i,j), linear.output=FALSE, threshold=threshold, rep=reps, stepmax = 10000)
        for (k in 1:reps){
          nn.results[1,k] <- nn.accuracy(nn.tested, testing.data, out.ideal, k)
        }
      
        benchmark.result[result.row,]<- c(i, j, mean(nn.results[1,]), max(nn.results[1,]), min(nn.results[1,]), sd(nn.results[1,]))
      })
      result.row <- result.row+1
    }
  }
  
  benchmark.result <- as.data.frame(benchmark.result)
  colnames(benchmark.result) <- c("layer 1", "layer 2", "average", "max", "min", "standard deviation")
  
  return(benchmark.result)  
}

##have fun with it
nn.benchmark(formula, df.train, df.test, out.ideal, 8, 12, 8, 12, 0.01, 10)

I ran the benchmark function between 1 and 15 nodes respectively in each of the hidden layers. In my case, the best candidate was an architecture with 12 nodes in the first layer and 14 in the second. It resulted in an average accuracy of 66.25%, with the best trial being right at 83.33%. The model with the highest accuracy overall was right 87.5% of the time and consisted of 14 and 13 hidden nodes.

Don't hesitate to play around with the threshold too. Depending on you problem, it might make convergence more difficult, but it could also sharpen your prediction.

When you have spotted the best candidate, train it again over multiple trials, pick the best one, et voila.

Sorry for the ugly nested loops. I spent 2 month on Python and I got used to it but I know that it's a very bad habit to have on R.

That's it for now. I hope that you were able to find some useful information. Don't hesitate to comment if you would like some clarification or something.

References :

https://www.biostat.wisc.edu/~kbroman/teaching/statgen/2004/refs/ct.pdf
http://www.sascommunity.org/seugi/SEUGI1994/Neural%20Networks%20and%20Statistical%20Models.pdf
http://www.loria.fr/~rougier/teaching/#neural-networks
http://www.heatonresearch.com/content/non-mathematical-introduction-using-neural-networks
http://www.bioconductor.org/packages/release/bioc/html/EBImage.html
Machine Learning, an Algorithmic Perspective, Stephen Marsland (2009)
Artificial Intelligence, a Modern approach, Stuart Russel and Peter Norvig (1995)