Training Neural Networks with Genetic Algorithms

In this blog post I present my findings of an independent analytical and computational study of using genetic algorithms to train neural networks. This was my final project for an Introduction to Cognitive Science course that I took at The University of Texas at Austin, under Dr. David Beaver.
My motivation comes from the fact that animal brains, and particularly the human brain which is the topic of this entire course, are themselves genetically evolved neural networks. Therefore, artificial neural networks trained by genetic algorithms are a good starting rudimentary model of understanding the hardware of the brain. This sentiment is echoed in my primary reference, Evolutionary Algorithms for Neural Network Design and Training, Branke et al (1995). However, the paper mostly discusses the idea qualitatively, and I decided that since it is a relatively simple algorithm to implement, I would benefit my understanding to a much greater extent by implementing it myself on some simple networks.

Background

McCulloch-Pitt Neurons

McCulloch-Pitt Neuron

A McCulloch-Pitt neuron is a simple decision-making unit that forms the building block for the kind of artificial neural networks I have constructed and studied. It has a set of inputs (each of which is a 0/1 bit) connected to it via weighted links. It checks if the weighted sum of the inputs exceeds a specified threshold t, and passes an output 1 if it does, and 0 if not.

Artificial neural networks

Neurons of the above kind can be connected and their weights and thresholds set in ways to have their action emulate desired computational operations. This is called an artificial neural network. The following are two simple networks that act as 2-input OR and AND gates:
OR Gate AND Gate
Each network has weights of 1 each. The OR gate has a threshold of 0.5 whereas the AND gate has a threshold of 1.5.
It should already become clear that there isn’t a unique set of weights and thresholds to have a neural network perform a certain task. We shall delve more on this later.

Training artificial neural networks

What if we wish our network to perform a certain kind of operation but we do not know what weights and thresholds will do the work? First, we need to define in a simple way what is meant by a certain operation. It means mapping the set of possible input tuples to their right answers. This is called a truth table. The following is the truth table for a two-input XOR gate for example:
\boldsymbol{i_{1}} \boldsymbol{i_{2}} \boldsymbol{o}
0 0 0
0 1 1
 1 0 1
 1  1 0
A standard way of converging on the right set of weights for a neural network prescribed to perform a certain task is by the use of learning rules. These rules update the weights based on the difference between the correct answer for a set of inputs and the actual answer.

Limitations of learning-rule based weight training

However, this simple method often fails to converge because the number of nodes in the network and the connections between them may be wrong to start with, such that no set of weights between them would provide the desired operation. We can see this when we use the simple connectivity of the OR and AND gates illustrated above and try to emulate a XOR gate and the process does not converge.
Therefore, in a more general case the problem might be extended to include the structure of the network as well, and now the question extends to finding:
  1. The number of nodes in the network.
  2. Their connectivity, i.e. the adjacency matrix (the i,j th element of this is 0 or 1 depending on whether the i th and j th nodes are connected).
  3. The weights of the connections. 2 and 3 may be combined into finding a weighted adjacency matrix, where a zero matrix element implies that two nodes are disconnected, and a non-zero element specifies the weight of the connection.
  4. The thresholds of firing of the neurons.
Some of these variables are discrete and others are continuous, and they are interrelated in a complex way. All of this means that learning rules cannot be simply generalized to find all the variables using an updating scheme. Herein lies the utility of genetic algorithms.

Genetic Algorithms

Genetic algorithms are commonly used where the problem can be formulated as a search in several dimensional parameter space, and a fitness function can be attached to any choice of parameters (a measure that mirrors how close the solution represented by those parameters is to the desired solution). The following schematic illustrates a typical genetic algorithm.
Genetic Algorithm Flowchart
In the context of constructing a neural network, genetic algorithms provide a natural method of solution in the general case where all of the mentioned variables are floating and can be concatenated into a string.

Implementation

There are some important points to note about my particular implementation of the method:
  1. For the sake of simplicity and owing to limited time, the only floating variables were the weights and thresholds, i.e. the number and connectivity of neurons was fixed. It is not hard to imagine extending this implementation to float those parameters as well.
  2. The fitness function is chosen to be the sum of the absolute differences between the correct answers and the answers given by the network. For example, consider the situation when we are trying to model an OR gate and our network gives us the following answers:
\boldsymbol{i_{1}} \boldsymbol{i_{2}} Actual answer Network’s answer
0 0 0 1
0 1 1 0
1 0 1 0
1 1 1 1

Then the fitness of the network is going to be |0-1|+|1-0|+|1-0|+|1-1|=3. By this definition, then, the lower the fitness, the better the network, and a fitness of zero means that the network achieves the desired behaviour for all inputs. As long as the fitness measure ranks the individuals accurately based on their performance, the exact form of the fitness is irrelevant to the working of the algorithm.

  1. The reproduction is sexless, i.e. there is no mating. This means that in each iteration, the fittest individual is selected and a new generation is created by mutating each of its parameters.
  2. To speed up convergence and make it monotonic, if in an iteration all of the offspring are less fit than the parent, the parent is selected again. This ensures there is no retrograde evolution.

Code

The following is the program in R:
xornet<-function(i1,i2,w){
  m1=i1*w[1]+i2*w[3]>w[7]
  m2=i1*w[2]+i2*w[4]>w[8]
  return(m1*w[5]+m2*w[6]>w[9])
}

#calculates the fitness of a network defined by a set of parameters
fitness<-function(w){
  s=0
  for (i1 in c(0,1)){
    for (i2 in c(0,1)){
      s=s+abs(xornet(i1,i2,w)-xor(i1,i2))
    }
  }
  return(s)
}

#initialize random population of weight vectors
pop<-NULL
for (i in 1:10)
  {
  pop<-rbind(pop, rnorm(9))
}
ftraj<-NULL
wspace<-NULL
for (gen in 1:1000){
  #sort according to fitness
  fitnesslist<-NULL
  for (i in 1:dim(pop)[1]){
    fitnesslist<-c(fitnesslist, fitness(pop[i,]))
  }  
  pop<-cbind(pop,fitnesslist)
  pop<-pop[order(pop[,10]),]
  #list the parameter vectors that work. Can be plotted to visualize the solution region in parameter space
  for (i in 1:10){
    if ((pop[,10]==0)[i]){
      wspace<-rbind(wspace,c(pop[i,1],pop[i,2]))
    }
  }
  #list highest fitness against generations
  ftraj<-rbind(ftraj,fitness(pop[1,]))
  
  #generate new generation by mutations from fittest
  fittest=pop[1,]
  pop<-NULL
  for (i in 1:9){
    pop<-cbind(pop, rnorm(10, mean = fittest[i], sd = 1))
  }
  pop<-rbind(fittest[1:9],pop)
}
plot(ftraj, pch=19, xlab="Generation", ylab="Fitness of the fittest")
lines(ftraj)

This code is implementing a neural network for a XOR gate, which corresponds to the highlighted lines. Those are the lines that have to be changed accordingly for other networks.

2-input OR Gate

Analytical observations

OR Gate

In order for the above network to reproduce the truth table of an OR gate, we must have the following:
  1. For (0,0): 0.w_{1}+0.w_{2} \leq t, i.e. t \geq 0.
  2. For (1,0): 1.w_{1}+0.w_{2}>t, i.e. w_{1}>t.
  3. Similarly for (0,1), we have w_{2}>t.
  4. For (1,1): 1.w_{1}+1.w_{2}>t, i.e. w_{1}+w_{2}>t, which is automatically satisfied as long as 2 and 3 are satisfied.
Thus, in the parameter space there is a region of solutions, not a unique set of weights and threshold. These solutions occupy the unbounded region defined by criteria 1, 2 and 3. The surfaces of such regions are called decision boundaries of the perceptrons (neural networks).

In my simplified implementation I further specified the threshold to 1 (as can be seen in the definition of the ornet function in the code), so that the search space of parameters reduces to the 2D w_{1},w_{2} space and the solution region is defined by the area w_{1},w_{2}>1, illustrated below:

OR Gate parameter space

Results

One may consider the evolution of the parameter vector to be a Markov chain. If we plot the fittest of each generation, the Markov chain looks like this for an example run:
OR Gate Markov Chain
Notice how we start outside the solution region and wander into it.
If we plot the fitness of the fittest individual of each generation for the markov chain, it converges to 0. Here is a plot for an example run:
OR Gate fitness

This is a scatter plot of w_{1},w_{2} values after convergence for a sample run:

OR Gate solutions

This is in accordance with the solution region schematically depicted previously.

2-input AND gate

Analytical observations

Following similar arguments as for the OR gate, we can conclude the following about the weights and thresholds if the network is supposed to work as an AND gate instead:
  1. For (0,0): 0.w_{1}+0.w_{2} \leq t,i.e. t \geq 0.
  2. For (1,0): 1.w_{1}+0.w_{2}\leq t, i.e. w_{1}\leq t.
  3. Similarly for (0,1), we have w_{2}\leq t.
  4. For (1,1): 1.w_{1}+1.w_{2}>t, i.e. w_{1}+w_{2}>t, which is no longer automatically satisfied.
For a fixed threshold (say 1), the solutions in the parameter space now occupy a bounded triangular region:
AND gate parameter space

Results

All the results are analogous to the case of the OR gate. The only plot I will show here is the space of solutions for an example run, which is in accordance with the solution region found analytically above:
AND Gate solutions

2-input XOR gate

Analytical observations

In an assignment we saw how a network connectivity like the one I’ve used so far fails to converge to the operation for a XOR gate for any choice of weights. This is unsurprising because the response of a XOR gate is not a linear function of its inputs. This requires the network to be more complex. In particular, here is a network constructed by observation that functions as a XOR gate:
XOR Gate
The intermediate OR gate ensures that at least one of the inputs is on, while the NOT AND gate ensures that both are not on at the same time. The outputs from these pass through a final AND gate that makes sure that both these conditions are met.

Results

As we can see, there is some variability introduced in the weights as well as the thresholds. This is an opportunity to use the power of the genetic algorithm to easily extend the method to include the thresholds as floating parameters as well, which is what I have done with the code. The number of nodes and the connectivity is still fixed.
The parameter space is now 9-dimensional, and the solution region occupies a much tinier fraction of it. Therefore convergence takes longer. Here is a plot of the convergence of fitness for a sample run:
XOR Gate fitness
The Markov chain or the solution parameter vectors cannot themselves be plotted because it is a 9-dimensional space.
The weight and threshold values that resulted from this run have been plugged into the schematic. As we can see, the logic of this network is quite different from the one constructed before by observation, yet it manages to produce the same truth table as a XOR gate. This is because the structure, nature, connectivity and operation of the hidden units of two networks may be different, yet their overall behaviour may be the same.
XOR Gate
The emergence of such alternate solutions allows one to see the problem anew in retrospect. This is indeed the beauty of genetic algorithms. It is an unguided method for obtaining seemingly designed solutions.

Conclusions

As we saw, we can use genetic algorithms to train artificial neural networks to perform desired computations. The advantages of the algorithm over standard learning-rule based updating are several.
  • It is more general, i.e. it can be used to find out the number of nodes, connectivity and firing thresholds as well, and is therefore easily scalable for large and complex networks.
  • It is an entirely unguided method and requires little design and foresight.
  • As a side-effect, it also provides a way to visualize the solution region in the search space by plotting solution vectors starting from different starting conditions.
A disadvantage is that since it is unguided by conscious design, it is often much slower to converge than learning-rule based updating schemes.

R code for multivariate random-walk Metropolis sampling

I couldn’t find a simple R code for random-walk Metropolis sampling (the symmetric proposal version of Metropolis Hastings sampling) from a multivariate target distribution in arbitrary dimensions, so I wrote one. This is also my first R code.
It requires the package MASS to sample from the multivariate normal proposal distribution using the mvrnorm function. If you are using R on Windows, you can download the package zip for Windows from the link, and use Packages > Install package(s) from local zip files… from the GUI to install the package.
The reason I couldn’t write the code for a general Metropolis algorithm (i.e. for any arbitrary symmetric proposal distribution, not just normal) or a more general Metropolis-Hastings algorithm (with any arbitrary proposal distribution, symmetric or not) is that generating the proposal point would then require sampling from an arbitrary proposal distribution. This is only easy for a few standard distributions, but hard in general (which is the point of using such algorithms in the first place).

I. Function

The following is the function that does the Random Walk Metropolis-Hastings sampling when supplied with the required arguments. Notes about the arguments follow the code.

rwmetro <- function(target,N,x,VCOV,burnin=0)
{
    require(MASS)   #requires package MASS for normal sampling
    samples <- x
    for (i in 2:(burnin+N))
    {
        prop <- mvrnorm(n = 1, x, VCOV)
        if (runif(1) < min(1, target(prop)/target(x)))
            x <- prop
        samples <- rbind(samples,x)
    }
    samples[(burnin+1):(N+burnin),]
}

II. Arguments

  1. target function
    The function defining the multivariate target distribution, written as a function of an n-vector, where n is the number of variables on which the distribution is defined. The different variables of your distribution must be written as x[1], x[2] etc.
    Example
    An example is the following, defining the function f(x,y)=exp(-5*abs(x^2+y^2-1)) in two dimensions:

    ring2D <- function(x)	# x is a vector
    {
    	exp(-5*abs(x[1]^2+x[2]^2-1))
    }
    
  2. N integer
    The final sample size (i.e., excluding the burn-in length).
    Example
    100
  3. x numeric vector
    The starting point (vector) of your Metropolis-Hastings chain. This vector needs to be the same length as the dimension of the target distribution. Your target function must be able to accept this vector as an argument.
    Example
    c(-0.1,20.5)
  4. VCOV numeric matrix
    The variance-covariance matrix for the multivariate normal that is used as the proposal distribution for random-walk Metropolis-Hastings. The length of this matrix also has to be the same as the dimension of the target distribution, i.e. the length of vectors acceptable by target and the length of x. You can vary the entries of this matrix and observe your results to see what works better for sampling your target distribution.
    Example
    The following line defines a matrix in two dimensions with .01 variance for each variable and no covariance between them.

    vcov2D <- .01*diag(2)
    
  5. burnin (optional) integer
    The ‘burn-in’ length for the chain. The number specified will be the number of initial samples chucked. If nothing is specified, it’s taken to be 0.
    Example
    20

III. Output
numeric matrix
The output is a matrix where each row is a sample from your target distribution, excluding the initial burn-in samples. The number of rows is thus the sample size, and the number of columns is equal to the dimension of the target distribution. You can use this matrix however you want, to save, visualize or analyze your sample.
Example
An example output of sampling a 2D distribution with sample size 5. Each row is an x,y sample.

[1,] 0.12961923 0.03708061
[2,] 0.10765053 -0.02798036
[3,] 0.01112930 -0.07255766
[4,] 0.06049625 -0.04546265
[5,] 0.1832164 -0.1244354

IV. Usage Example
Let’s take the target distribution we used as an example above, f(x,y)=exp(-5*abs(x^2+y^2-1)). This looks like a ring of radius 1 rising from the x-y plane and centered at the origin. Here is a gnuplot surface plot of the distribution (because I found it frustratingly hard to figure out a level plot in R):

Target Distribution in Gnuplot
Let’s generate a sample of size 40,000 from this distribution with the starting point (0,0) and without any burn-in length, and with the variance-covariance matrix we defined before. This is done by calling the function with the correct arguments:

ringsample2D <- rwmetro(ring2D,40000,c(0,0), vcov2D)

This assumes that you have already defined the target ring2D and the matrix vcov2D in the way explained, preceding this function call.
ringsample now contains a random sample from the distribution.
With the following one-line R code, I made a scatter plot of the sample from the ringsample matrix.

plot(ringsample2D[,1], ringsample2D[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='Metropolis-Hastings Sample',xlab='x', ylab='y', pch='.')

The following is the plot:
Metropolis-Hastings Sample, scatter plot

Putting all the code together in sequence, this is what the full code for defining the arguments, drawing the sample, and making the plot for this example looks like:

# Define arguments
ring2D <- function(x)    # x is a vector
{
    exp(-5*abs(x[1]^2+x[2]^2-1))
}

vcov2D <- .01*diag(2)

#Define the sampling function
rwmetro <- function(target,N,x,VCOV,burnin=0)
{
    require(MASS)   #requires package MASS for normal sampling
    samples <- x
    for (i in 2:(burnin+N))
    {
        prop <- mvrnorm(n = 1, x, VCOV)
        if (runif(1) < min(1, target(prop)/target(x)))
            x <- prop
        samples <- rbind(samples,x)
    }
    samples[(burnin+1):(N+burnin),]
}

# Call the sampling function with the arguments
ringsample2D <- rwmetro(ring2D,40000,c(0,0),vcov2D)

# Use the sample
plot(ringsample2D[,1], ringsample2D[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='Metropolis-Hastings Sample',xlab='x', ylab='y', pch='.')

The following is another sample, this time for a 3-variable function, f(x,y,z)=exp(-5*abs(x^2+y^2+z^2-1)). This is just the 3D version of the previous function, and its density is mostly concentrated on a spherical shell of radius 1 and of some thickness, centered at (0,0,0). Notice how we never have to spare a thought about normalizing these functions to get the sample, which is one of the advantages of the Metropolis-Hastings algorithm.
The target function is now:

ring3D <- function(x)	# x is a vector
{
	exp(-5*abs(x[1]^2+x[2]^2+x[3]^2-1))
}

and the variance-covariance matrix is, say, .01 down the diagonal again:

vcov3D <- .01*diag(3)

Let’s go for a sample of 20000 points after a burn-in of 20000, starting from (0,0,0). The function is then called as:

ringsample3D <- rwmetro(ring3D,20000,c(0,0,0),vcov3D,20000)

ringsample3D is now a 3×1000 matrix, and I use the following to get a rotatable 3D plot of it (requires the rgl package):

require(rgl)
plot3d(ringsample3D[,1], ringsample3D[,2], ringsample3D[,3], xlab='x',ylab='y', zlab='z',col='red', size=1)

Metropolis Sample

I would have preferred a spinning 3D gif, but I’m too tired to work that out right now.
Share the knowledge.

Python Script to Generate Frequency Counts of Words in a Text

The following python script takes a text file as input and produces an unsorted list of frequency counts of words in the text as an output text file. It’s pretty simple and short, and uses only the regular expressions module re of python, which is a standard library, so this script will run in any system with a standard python installation.

from re import compile
l=compile("([\w,.'\x92]*\w)").findall(open(raw_input('Input file: '),'r').read().lower())
f=open(raw_input('Output file: '),'w')
for word in set(l):
	print>>f, word, '\t', l.count(word)
f.close()

Note that ‘words’ here doesn’t mean dictionary words (with such a small script it’s not possible to check against dictionary words). Instead, ‘words’ are what you get when you split the text at regular expression word boundaries. So if you have a word like “365b1”, that’ll also be listed in the output.

Here’s an example.

Input file contains:

This is text. This is written with intentionally repeated words. This is repeated, intentionally, to produce short output. This text — with intentionally repeated words — is written to produce short text output.
This output is text. Intentionally short text.

Output file will contain:

short 			3
this 			5
text 			5
is 				5
repeated 		3
intentionally 	4
to 				2
written 		2
produce 		2
words 			2
output 			3
with 			2

The columns are tab-separated, so you can copy this into spreadsheets which should detect this as two columns and paste accordingly. Then you can draw frequency plots or whatever.
If you want, you can write a bit more code to sort the output alphabetically or by frequency count. I didn’t bother.

A Statistical Problem on Laptop Uptimes

Suppose you are in a large university campus. Most students here use laptops, and if you look around, you’d see most of them either working, listening to music or doing something else on their laptops. Suppose now you think of a quick project, of listing the uptimes of the laptops (how long they’ve been running). In Windows this is quite simple. Under the ‘Performance’ tab in Task Manager, you’ll notice that ‘Uptime’ gives the duration for which the laptop has been running. This timer, however, keeps counting from where it left off if you resume from hibernation, as it should, but we shall assume that no such cases happen in our campus.

Now, the campus is huge, and many students are using their laptops. You figure out some methodical way of visiting each student so that there are no over- or under-counting errors. But it takes you quite a while to visit all of them and note down their uptimes. So that if, for example, it took you six hours to collect all the data, then you took the last uptime reading six hours after the first.

If we assume a very large campus where the uptimes of the laptops of different students are completely independent of each other, the questions are the following:

1. On an average, is it going to make any difference to the statistics you collected, if instead of taking a long time to go around the campus, you could somehow acquire all of the uptimes at one instant of time?

2. Is the data you collected going to be distributed differently from that of the maximum uptimes of student laptops (duration before they shut it down)?

Unless I find myself without the time and effort, I plan to return and solve it in this blog post. (I haven’t solved it yet.)

Image Appearance Variation across Desktop Viewers and Websites

I’ve been taking photographs, editing them on my laptop and posting them to several websites for a while now, and I’ve noticed that there are variations in the appearance of an image, mostly richness, sharpness and grains, among some common methods of viewing. These may be different image viewers on your computer, setting the image as your wallpaper, or different websites where you post the image. The differences may be subtle (if they were not, it would be a really big deal that wouldn’t persist till now), but to photographers, graphic designers and in general to people whose work or passion is to deal with images, I deem these are still big differences that may often cause problems. I do not know what causes these differences, but only wish to demonstrate them so that in case such subtle differences matter to you as they do to me, you will not waste frustrated hours trying to track down the problem to where it does not lie, such as your photography, editing or design.

So here’s a series of screenshots of the same photograph that I’ve taken, across various viewing methods, showing the variation. I have ordered them so that similar-looking appearances are clumped together. Within brackets are my ratings for how well I think they reproduced the image.

1. Photoshop (10/10)

PS

In my experience, the best reproduction. Colours are rich and the image appears sharp and without any additional noise.

6. Windows Photo Viewer (10/10)

Windows Photo Viewer

This is the default image viewer in Windows 7 that I use. Pretty much the same as PS. No complaints.

2. Windows Explorer Preview (9/10)

By that, I mean the large image preview that you get in Explorer in Win 7. Here’s a screenshot to explain what I mean:

Windows Preview

The image itself looked like this:

Windows Preview 1

Notice immediately that the image is slightly duller and not as sharp as PS or Windows Photo Viewer. In its defence though, this is not even an image viewer we are talking about, and to me this is good enough for a quick preview.

2. Facebook (9/10)

facebook

This is more or less similar, perhaps a tad duller.

3. Flickr (9/10)

flickr

Almost the same as Flickr. The sharpness is returning though.

5.Picasa (6/10)

Picasa

Disappointingly dull. I used to get pretty frustrated when after working long in PS on a photo to get the richness and clarity I like, I would save the image and see this kind of output in Picasa, my default image viewer. Then I realized that it is a fundamental difference between their respective image reproducibility that I can do nothing about, and shouldn’t worry about. No real richness has been lost. However, this being a dedicated image viewer that many people install and use over the default viewers, it is unpardonable on part of the developers at Google.

6. Windows Desktop Wallpaper (5/10)

Wallpaper

This one’s a disaster. Not only is it even duller than Picasa, the moment you set any image with the slightest of rich colour as your desktop background, there suddenly arises a lot of grain. At this size on this blog it is perhaps not visible, but you can click the image to view the full size and compare with the others, when the grain will be clearly discernible.

I have often happily set my photographs as the desktop background, only to disappointedly revert because of the grain. The default Windows wallpapers are rigged somehow so that they appear smooth. I think they had absolutely no grain to begin with. But if an image has the slightest grain, it is greatly amplified upon setting as the desktop background.