# R code for multivariate random-walk Metropolis sampling

I couldn’t find a simple R code for random-walk Metropolis sampling (the symmetric proposal version of Metropolis Hastings sampling) from a multivariate target distribution in arbitrary dimensions, so I wrote one. This is also my first R code.
It requires the package MASS to sample from the multivariate normal proposal distribution using the mvrnorm function. If you are using R on Windows, you can download the package zip for Windows from the link, and use Packages > Install package(s) from local zip files… from the GUI to install the package.
The reason I couldn’t write the code for a general Metropolis algorithm (i.e. for any arbitrary symmetric proposal distribution, not just normal) or a more general Metropolis-Hastings algorithm (with any arbitrary proposal distribution, symmetric or not) is that generating the proposal point would then require sampling from an arbitrary proposal distribution. This is only easy for a few standard distributions, but hard in general (which is the point of using such algorithms in the first place).

I. Function

The following is the function that does the Random Walk Metropolis-Hastings sampling when supplied with the required arguments. Notes about the arguments follow the code.

rwmetro <- function(target,N,x,VCOV,burnin=0)
{
require(MASS)   #requires package MASS for normal sampling
samples <- x
for (i in 2:(burnin+N))
{
prop <- mvrnorm(n = 1, x, VCOV)
if (runif(1) < min(1, target(prop)/target(x)))
x <- prop
samples <- rbind(samples,x)
}
samples[(burnin+1):(N+burnin),]
}

II. Arguments

1. target function
The function defining the multivariate target distribution, written as a function of an n-vector, where n is the number of variables on which the distribution is defined. The different variables of your distribution must be written as x[1], x[2] etc.
Example
An example is the following, defining the function f(x,y)=exp(-5*abs(x^2+y^2-1)) in two dimensions:

ring2D <- function(x)	# x is a vector
{
exp(-5*abs(x[1]^2+x[2]^2-1))
}

2. N integer
The final sample size (i.e., excluding the burn-in length).
Example
100
3. x numeric vector
The starting point (vector) of your Metropolis-Hastings chain. This vector needs to be the same length as the dimension of the target distribution. Your target function must be able to accept this vector as an argument.
Example
c(-0.1,20.5)
4. VCOV numeric matrix
The variance-covariance matrix for the multivariate normal that is used as the proposal distribution for random-walk Metropolis-Hastings. The length of this matrix also has to be the same as the dimension of the target distribution, i.e. the length of vectors acceptable by target and the length of x. You can vary the entries of this matrix and observe your results to see what works better for sampling your target distribution.
Example
The following line defines a matrix in two dimensions with .01 variance for each variable and no covariance between them.

vcov2D <- .01*diag(2)

5. burnin (optional) integer
The ‘burn-in’ length for the chain. The number specified will be the number of initial samples chucked. If nothing is specified, it’s taken to be 0.
Example
20

III. Output
numeric matrix
The output is a matrix where each row is a sample from your target distribution, excluding the initial burn-in samples. The number of rows is thus the sample size, and the number of columns is equal to the dimension of the target distribution. You can use this matrix however you want, to save, visualize or analyze your sample.
Example
An example output of sampling a 2D distribution with sample size 5. Each row is an x,y sample.

[1,] 0.12961923 0.03708061
[2,] 0.10765053 -0.02798036
[3,] 0.01112930 -0.07255766
[4,] 0.06049625 -0.04546265
[5,] 0.1832164 -0.1244354

IV. Usage Example
Let’s take the target distribution we used as an example above, f(x,y)=exp(-5*abs(x^2+y^2-1)). This looks like a ring of radius 1 rising from the x-y plane and centered at the origin. Here is a gnuplot surface plot of the distribution (because I found it frustratingly hard to figure out a level plot in R):

Let’s generate a sample of size 40,000 from this distribution with the starting point (0,0) and without any burn-in length, and with the variance-covariance matrix we defined before. This is done by calling the function with the correct arguments:

ringsample2D <- rwmetro(ring2D,40000,c(0,0), vcov2D)

This assumes that you have already defined the target ring2D and the matrix vcov2D in the way explained, preceding this function call.
ringsample now contains a random sample from the distribution.
With the following one-line R code, I made a scatter plot of the sample from the ringsample matrix.

plot(ringsample2D[,1], ringsample2D[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='Metropolis-Hastings Sample',xlab='x', ylab='y', pch='.')

The following is the plot:

Putting all the code together in sequence, this is what the full code for defining the arguments, drawing the sample, and making the plot for this example looks like:

# Define arguments
ring2D <- function(x)    # x is a vector
{
exp(-5*abs(x[1]^2+x[2]^2-1))
}

vcov2D <- .01*diag(2)

#Define the sampling function
rwmetro <- function(target,N,x,VCOV,burnin=0)
{
require(MASS)   #requires package MASS for normal sampling
samples <- x
for (i in 2:(burnin+N))
{
prop <- mvrnorm(n = 1, x, VCOV)
if (runif(1) < min(1, target(prop)/target(x)))
x <- prop
samples <- rbind(samples,x)
}
samples[(burnin+1):(N+burnin),]
}

# Call the sampling function with the arguments
ringsample2D <- rwmetro(ring2D,40000,c(0,0),vcov2D)

# Use the sample
plot(ringsample2D[,1], ringsample2D[,2], xlim=c(-1.5,1.5),ylim=c(-1.5,1.5), main='Metropolis-Hastings Sample',xlab='x', ylab='y', pch='.')

The following is another sample, this time for a 3-variable function, f(x,y,z)=exp(-5*abs(x^2+y^2+z^2-1)). This is just the 3D version of the previous function, and its density is mostly concentrated on a spherical shell of radius 1 and of some thickness, centered at (0,0,0). Notice how we never have to spare a thought about normalizing these functions to get the sample, which is one of the advantages of the Metropolis-Hastings algorithm.
The target function is now:

ring3D <- function(x)	# x is a vector
{
exp(-5*abs(x[1]^2+x[2]^2+x[3]^2-1))
}

and the variance-covariance matrix is, say, .01 down the diagonal again:

vcov3D <- .01*diag(3)

Let’s go for a sample of 20000 points after a burn-in of 20000, starting from (0,0,0). The function is then called as:

ringsample3D <- rwmetro(ring3D,20000,c(0,0,0),vcov3D,20000)

ringsample3D is now a 3×1000 matrix, and I use the following to get a rotatable 3D plot of it (requires the rgl package):

require(rgl)
plot3d(ringsample3D[,1], ringsample3D[,2], ringsample3D[,3], xlab='x',ylab='y', zlab='z',col='red', size=1)

I would have preferred a spinning 3D gif, but I’m too tired to work that out right now.
Share the knowledge.

# Partners meet halfway: a simple correlation study of an undergrad lab class

Last semester I taught two classes of an Introductory Physics (Electrodynamics) for Engineers lab course at the University of Texas at Austin. The first day that my students came in, most of them did not know each other. They just came in and sat at tables of two, and this pretty much became their permanent seats in the room for the rest of the course (that’s human habit). They were supposed to work with their partner on the experiments in the first part of the class, but individually on their worksheets in the second part. So choosing their seats also fixed the partners. Thus, it is a good assumption to say that the pairing had been pretty random, at least as far as academic ability of the students was concerned.

I noticed though that both partners in a group tended to move towards a common average in their performance over the course. If one partner was diligent, it would motivate (and sometimes force) the other to work harder, while the sloppiness and uncaring attitude of a partner would also negatively affect the performance of the other. I was happy to notice though that the ‘uplifting’ effect was usually more dominant than the ‘down-dragging’. In any case, this caused the performance of partners to pull closer, irrespective of their starting academic ability.

At the end of the semester, after I had the final scores that I handed out to them, I decided to do a simple correlation test. The following are two scatter plots of pairs of student scores for each of my classes, each pair ordered as (better score, worse score) (which is why all points are below the bottom-left to top-right diagonal of the plot). The linear fits through them have also been shown. Although I had suspected that there would be a significant correlation between partners’ scores, I was surprised at how high the correlation coefficients turned out to be. (Note that if you pair an even number of numbers randomly, on the average you should expect a linear correlation coefficient of zero.)

I had felt this trend even by the middle of the semester, and I had this crazy idea of using this to try and pull up the performance of the students who were not doing too good. I thought of sorting the students by their score so far into the course, and re-assigning partners so as to pair the best student with the worst, the second-best to the second-worst and so on. However, I know that habit is a deadly force, and if I suggest my students in the middle of the course that they have to say goodbye to their partner with whom they already have some steady chemistry going and have new partners again, ones that I have chosen for them, all hell would break loose. So I had to rest it in my head as a theoretical result with possible practical applications, to be taken up by a braver soul than me. But hey, here’s the data, so you can quote it if you want to try it and someone doesn’t take the idea too well.

# A/B and Rh Antigens in Blood Types: A Statistical Test of Independence among IISER Kolkata Students

A couple of days ago one of my juniors in college (Indian Institute of Science Education and Research, Kolkata) found unguarded in the guest account of our computing system a spreadsheet.

This spreadsheet contained the blood types of the Masters students across five batches (‘07-‘13). With this information he made a nice little bar diagram of the frequency distribution of blood types, which I lift here:

When I looked at this graph, what I noticed first was that this contains the distribution across both the antigen type (A/B/AB/O) and the Rh factor (+/-). It struck me that it may be possible by constructing a contingency table to check whether these two properties are independent of each other.

For this I first constructed from the graph the two-way contingency table of the joint distribution of antigen and Rh factor:

 ↓Rh\Ag→ A B AB O Marginal totals + 76 116 28 122 342 – 1 4 1 6 12 Marginal totals 77 120 29 128 354 (total)

Consider now the marginal distributions, i.e. the subtotals along the last row and column. If we divide these numbers through by the total number of data points (354), we obtain the relative marginal frequencies. For a large sample, this may be identified with the probabilities of occurrence of the individual properties irrespective of that of the other.

Now comes the important part. If the occurrence of these two properties are independent of each other, then the joint probability in any cell shall be the product of the marginal probabilities for that row and column (P(A∩B)=P(A)P(B)). In terms of frequencies, this means that the number in any cell should be the product of the subtotals for that row and column, divided by the total (354). You can think of it this way: the frequency in any cell is the product of one of its marginal frequencies with the other relative marginal frequency (signifying the independent conditional probability). Example:

n(A+) = n(A).P(+) = n(A).n(+)/N.

Taking this assumption of independence then, it is possible to construct a contingency table inwards starting from just the outer marginal frequencies. This has been constructed below with the marginal frequencies of the actual data, and rounded to the nearest integers.

 ↓Rh\Ag→ A B AB O Marginal totals + 74 116 28 124 342 – 3 4 1 4 12 Marginal totals 77 120 29 128 354(total)

If you compare the joint frequencies in the two tables, you can see immediately that they are very close, lending support to the assumption that these factors are independent.

For a more graphical idea, I decided to plot cluster bars of the actual and the computed frequencies assuming independence.

The average error between the actual and computed frequencies was 0 over the eight blood types. That doesn’t say much because errors of opposite signs reduce the net effect. The root mean squared error was 1.16, which too is tiny in comparison to the frequencies themselves.

Thus, it is pretty safe to suppose that the antigen and the Rh factor are uncorrelated properties.

There are stronger, more explicit tests of association than what has been done here. The usual correlation coefficient, though, cannot be calculated here because the variable (blood type) is not a quantitative one. However, there are others that you can read about here.

If you want to test this method or any of the other methods on other datasets of blood types, this page provides quite a bit of data for various countries (although last I checked, they seemed to not be completely factually correct).

# A Statistical Problem on Laptop Uptimes

Suppose you are in a large university campus. Most students here use laptops, and if you look around, you’d see most of them either working, listening to music or doing something else on their laptops. Suppose now you think of a quick project, of listing the uptimes of the laptops (how long they’ve been running). In Windows this is quite simple. Under the ‘Performance’ tab in Task Manager, you’ll notice that ‘Uptime’ gives the duration for which the laptop has been running. This timer, however, keeps counting from where it left off if you resume from hibernation, as it should, but we shall assume that no such cases happen in our campus.

Now, the campus is huge, and many students are using their laptops. You figure out some methodical way of visiting each student so that there are no over- or under-counting errors. But it takes you quite a while to visit all of them and note down their uptimes. So that if, for example, it took you six hours to collect all the data, then you took the last uptime reading six hours after the first.

If we assume a very large campus where the uptimes of the laptops of different students are completely independent of each other, the questions are the following:

1. On an average, is it going to make any difference to the statistics you collected, if instead of taking a long time to go around the campus, you could somehow acquire all of the uptimes at one instant of time?

2. Is the data you collected going to be distributed differently from that of the maximum uptimes of student laptops (duration before they shut it down)?

Unless I find myself without the time and effort, I plan to return and solve it in this blog post. (I haven’t solved it yet.)

# Flickr photo view counts: an elementary analysis

I was taking a casual look at the number of views on my flickr photos, when I noticed something that should not appear very surprising: view counts are low for the first few days, then gradually grew to a higher region (around 100 for me). This idea came to me to actually plot the view counts of the photos against the number of days they had been online, to visualize the trend. So I did it using Excel. You just need to enter the date of upload and the current date into date-formatted cells, and then the usual subtraction command happily gives you the number of days between those dates, so that was pretty easy. Here’s the result from my rather limited dataset of 33 photos:

There are some statutory notes before drawing any conclusion from this graph. One is that not all photos grab the same attention. Some are better than others, and will digress from a trendline decided simply by the number of days passed, like the highest point in this chart, which is, according to me, the best photo I have posted to flickr so far. It is not expected therefore that a graph like this should show a smooth pattern because there are other factors that affect views, like its quality, and how well it was shared and publicized through various social media etc. Also, as I slowly gather contacts and people who follow my photostream and watch for my uploads, I’ll expect new uploads to get more attention than another of the same quality did in the past.

Even keeping all these in mind, though, there seems to be some degree of rise in this pretty scattered graph. The linear correlation coefficient (although I don’t expect the correlation to be linear) is around 0.39. That’s about a third of the way upwards from totally random. Extending that observation, if I imagine a statistically averaged trendline over many photos of different qualities and different degrees of online publicity, i.e. I want to think only of the effect of days passed, several properties of such a trendline curve logically come to my mind:

• It shall start from the origin.
• It shall be monotonically increasing, of course. Photos cannot be unviewed once they’ve been viewed.

Wait, did you fall for that second one? Because I’d be surprised if you didn’t. I fell for that myself, until just some time back when I relented to humor a tiny splinter in my brain that was groaning against this argument ever since I thought about it. The groaning ensued from memories of a related puzzler in ensemble averages that I had encountered in Statistical Mechanics once, and eventually turned out to be quite legit.

The truth is, there’s actually no reason why that averaged curve should necessarily be monotonically increasing. Why? Well, a point on that curve has a certain x-coordinate, and so corresponds to the average over all photos that are a certain number of days old. Another point, with a different x-coordinate, is an average over a different (and completely mutually disjoint) population of photos. And while the average view count of a fixed set of photos must necessarily go up with time (each view count goes up, so sum goes up, so average goes up), nothing can be said of a comparison between view counts of two disjoint sets of photos at different points of time. It might very well be that the photos you posted five years ago have never received the limelight so long that your now awesomely professional photos have hogged in just a few months. Thus, the averaged curve may at times even drop with increasing online age. Which, in fact, my scatter plot seems to indicate to some degree.

Thus, while a time series plot with gradually falling y-coordinates (where this coordinate means something good, like views) is in almost all cases bad news, now I know that in this case it is a most enviable sign of growing reputation.

So we must strike out that second property. On to the next:

• There will be an initial spike in views as the photo is uploaded and the ripples spread through flickr to your contacts, to other pages, and possibly through linked accounts to other sites. This means a higher slope near the beginning, which decays eventually at a rate that I don’t know anything about at the moment, except that it will probably be of the order of a couple of days.
• In the long run, when these transient effects have decayed out, the only thing that keeps view counts going up is the fixed background rate at which people chance upon your photos on flickr. I don’t know what this rate is. But whatever it is, barring the reputation effect I mentioned, it can be assumed to be fixed for a flickr profile, unchanging in time. In real life though, it rises when you gather more contacts, increasing the audience that can discover your photos by some avenue. It’s in no way a negligible effect. Reputation and recognition matter. In fact, that’s finally what most people on flickr and elsewhere are striving for. But ignoring that effect, asymptotically the trendline should become a straight line with positive slope.

There are several curves that have all these properties, like the familiar parabola. The actual curve that will fit this hypothetical data is unknown at this point, of course. The trendline I fitted with my dataset was a parabola, with no manipulated parameters, all floating, and it clearly shows the fall towards the end. Although I strongly suspect this could be a contribution from that outlier high point (that’s where the hump of the curve is). With my hopelessly insufficient data, this is all pretty arbitrary at this stage:

That’s all I wanted to say, and by itself this is not very interesting stuff, but maybe someone will get some other interesting ideas from this. Like maybe plotting a reputation growth curve calculated from the departure (fall) of this view count curve as compared to the idealized, constant-reputation view count curve which asymptotes to a rising straight line as I mentioned.