Flickr photo view counts: an elementary analysis

I was taking a casual look at the number of views on my flickr photos, when I noticed something that should not appear very surprising: view counts are low for the first few days, then gradually grew to a higher region (around 100 for me). This idea came to me to actually plot the view counts of the photos against the number of days they had been online, to visualize the trend. So I did it using Excel. You just need to enter the date of upload and the current date into date-formatted cells, and then the usual subtraction command happily gives you the number of days between those dates, so that was pretty easy. Here’s the result from my rather limited dataset of 33 photos:

image

There are some statutory notes before drawing any conclusion from this graph. One is that not all photos grab the same attention. Some are better than others, and will digress from a trendline decided simply by the number of days passed, like the highest point in this chart, which is, according to me, the best photo I have posted to flickr so far. It is not expected therefore that a graph like this should show a smooth pattern because there are other factors that affect views, like its quality, and how well it was shared and publicized through various social media etc. Also, as I slowly gather contacts and people who follow my photostream and watch for my uploads, I’ll expect new uploads to get more attention than another of the same quality did in the past.

Even keeping all these in mind, though, there seems to be some degree of rise in this pretty scattered graph. The linear correlation coefficient (although I don’t expect the correlation to be linear) is around 0.39. That’s about a third of the way upwards from totally random. Extending that observation, if I imagine a statistically averaged trendline over many photos of different qualities and different degrees of online publicity, i.e. I want to think only of the effect of days passed, several properties of such a trendline curve logically come to my mind:

  • It shall start from the origin.
  • It shall be monotonically increasing, of course. Photos cannot be unviewed once they’ve been viewed.

Wait, did you fall for that second one? Because I’d be surprised if you didn’t. I fell for that myself, until just some time back when I relented to humor a tiny splinter in my brain that was groaning against this argument ever since I thought about it. The groaning ensued from memories of a related puzzler in ensemble averages that I had encountered in Statistical Mechanics once, and eventually turned out to be quite legit.

The truth is, there’s actually no reason why that averaged curve should necessarily be monotonically increasing. Why? Well, a point on that curve has a certain x-coordinate, and so corresponds to the average over all photos that are a certain number of days old. Another point, with a different x-coordinate, is an average over a different (and completely mutually disjoint) population of photos. And while the average view count of a fixed set of photos must necessarily go up with time (each view count goes up, so sum goes up, so average goes up), nothing can be said of a comparison between view counts of two disjoint sets of photos at different points of time. It might very well be that the photos you posted five years ago have never received the limelight so long that your now awesomely professional photos have hogged in just a few months. Thus, the averaged curve may at times even drop with increasing online age. Which, in fact, my scatter plot seems to indicate to some degree.

Thus, while a time series plot with gradually falling y-coordinates (where this coordinate means something good, like views) is in almost all cases bad news, now I know that in this case it is a most enviable sign of growing reputation.

So we must strike out that second property. On to the next:

  • There will be an initial spike in views as the photo is uploaded and the ripples spread through flickr to your contacts, to other pages, and possibly through linked accounts to other sites. This means a higher slope near the beginning, which decays eventually at a rate that I don’t know anything about at the moment, except that it will probably be of the order of a couple of days.
  • In the long run, when these transient effects have decayed out, the only thing that keeps view counts going up is the fixed background rate at which people chance upon your photos on flickr. I don’t know what this rate is. But whatever it is, barring the reputation effect I mentioned, it can be assumed to be fixed for a flickr profile, unchanging in time. In real life though, it rises when you gather more contacts, increasing the audience that can discover your photos by some avenue. It’s in no way a negligible effect. Reputation and recognition matter. In fact, that’s finally what most people on flickr and elsewhere are striving for. But ignoring that effect, asymptotically the trendline should become a straight line with positive slope.

There are several curves that have all these properties, like the familiar parabola. The actual curve that will fit this hypothetical data is unknown at this point, of course. The trendline I fitted with my dataset was a parabola, with no manipulated parameters, all floating, and it clearly shows the fall towards the end. Although I strongly suspect this could be a contribution from that outlier high point (that’s where the hump of the curve is). With my hopelessly insufficient data, this is all pretty arbitrary at this stage:

image

That’s all I wanted to say, and by itself this is not very interesting stuff, but maybe someone will get some other interesting ideas from this. Like maybe plotting a reputation growth curve calculated from the departure (fall) of this view count curve as compared to the idealized, constant-reputation view count curve which asymptotes to a rising straight line as I mentioned.

Advertisements

Locating Numbers inside Bisected Interval Sequences

I think in a real analysis course in the second semester of my first year, the teacher was discussing the nested interval theorem, when one of his examples or something he was saying struck me, and I thought of this interesting problem. Well, interesting to me.

We pick any fraction, say. Now we look at the interval [0,1]. We divide it into two halves [0,0.5] and [0.5,1] and say, ‘the fraction belongs to this half.’ Say the right half. Then we divide the right half into two halves, check again, and say ‘now it’s in the left half’. We continue like this until we hit the number bang in the middle of an interval.

Now that’s not really a problem, but I thought it would be an interesting thing to look at this sequence of ‘left’ and ‘right’ for a chosen fraction. So I wrote a python program for that. Nothing very amusing came out of that. Then I thought of something else. I took evenly spaced fractions in that interval along the horizontal axis, and plotted the fraction of rights in their respective left-right sequences, on the vertical axis, using matplotlib. Here is the python source code:

#! usr/bin/python
import matplotlib.pyplot as plt
c = 0.
x=[]
y=[]
while c<=1.:
    a = 0.
    b = 1.
    dc = c - a
    d = (b-a)/2
    R=1
    L=1
    while True:
        if dc > d:
            R+=1
            a = a + d
        elif dc < d:
            L+=1
            b = b - d
        elif dc==d:
            break
        d = float(b-a)/2
        dc = c - a
    x.append(c)
    y.append(float(R)/(R+L))
    c+=1e-4
plt.xlabel('Fraction')
plt.ylabel('''Fraction of 'Right's in sequence''')
plt.plot(x, y, marker='.', markerfacecolor='blue', linestyle='None')
plt.show()

This is what I got:

graph1

Now, for example, 0.375 = 0.5 – 0.25 + 0.125. A minus sign means an L, a plus sign is an R. So 0.375 is LR. 0.625, which is the fraction the same distance from the right as 0.375 is from the left, is 0.5 + 0.25 – 0.125. So it’s RL. So as you look at fractions equidistant from 0.5 on either side of it, all the R’s and L’s in their sequence get switched. Therefore, the fraction of R’s in one should be the fraction of L’s in the other, or 1 – fraction of R’s. Thus, you expect the graph to be symmetric about the point (0.5, 0.5). (Think about this, no hurry.) What miffed me at this point, therefore, was that this graph didn’t appear to be symmetric with respect to its center point. There’s some fuzzy mess to the left and some scattered points isolated from the main band that are not symmetric at all.

Then I ran some tests with fractions whose sequences aren’t supposed to end at all. Like what? Like 0 and 1, say. If you’ve followed the algorithm, you can tell that we can never arrive at a cleaving of an interval where the separating number is either 0 or 1, because there’s nothing on one side of these numbers. So 0 should just give me LLLLL… and 1 should give me RRRRR…, never ending. However, guess what I found when I looked at the number of L’s and R’s in their sequence.

0    L: 1074, R: 0.

1    L: 0, R: 54.

So why do the sequences end? That’s fairly simple. It’s because of the limitation of storing and computing floating point numbers in a computer. Notice that with each step of the sequence we are squeezing our number tighter and tighter, into an interval that is halving its length with each iteration. Very soon, our computer (or the interpreter) arrives at a point that numbers so infinitesimally separated in that tiny interval are no longer separate numbers to it, and so it cannot differentiate between our fraction and the mid-point of the interval, and stops.

Exactly how big is this error? It is difficult to tell from looking at these numbers above. One tells you it should be 1/2 1074, the other tells you it’s 1/2 54 (which is closer to where I’d put it, owing to other checks I did and don’t want to discuss here). The final result has to do with all the calculations it is doing at every step, and so all the floating point errors that accumulate at every step. However, I think the only way the answer could still be different for a fraction and its ‘mirror-image’ is if different floating point errors are associated with addition and subtraction, because these two operations have been switched for them.

Notice, though, that the fraction of R’s for 0 is 0, and that for 1 is 1. The symmetry is preserved. So where is the final problem in the plot? Well, we’ve been lucky with these two numbers because one of the counts is 0 for both cases. I’ll give you an example of another case:

0.1    L: 28, R: 26.

0.9    L: 25, R: 27.

In this case, obviously, the symmetry will not be maintained, because the second pair is 25,27 instead of being 26,28. Thus, the graph is no longer symmetric about the center point.

Since I was stubborn about getting a symmetric graph, I decided that I’ll cut off the process before it gets to the ambiguous stage, that is stop with a wide enough interval length, and plot a graph with the truncated sequences. I finally got a symmetric one when I set the interval length at the order of 1e-13. For this, in line 20 (highlighted), instead of elif dc==d you need to write elif abs(dc-d)<=1e-13. Here is the resulting graph:

graph2

Note, however, that this error tolerance is not something fixed. It depends on the resolution (spacing) of fractions you want to do this computation for. In the images you have so far seen, the fractions were multiples of 10-4. You get a better image with one order lower, but for that the error tolerance had to be jacked up to 1e-11:

graph3

Do you see something really interesting in this graph now, in the way that it organizes itself into parallelograms within parallelograms? It’s a highly ordered fractal. I’ve marked them out for clarity:

graph

In other words, the point symmetry is repeated on increasingly smaller scales, as it should. The whole bisected nature of the nested intervals is responsible for this. More parallelograms would be revealed if we kept making our resolution finer, and the horizontal extent of these parallelograms are only exhibiting those nested intervals.

The fraction of rights, however, doesn’t reveal a lot of information. More interesting could be to see how many such bisection steps are required before we converge onto a number. For this you need to modify the source code just a bit. On line 25 (highlighted), substitute float(R)/(R+L) with R+L, and you get this:

graph5

The black dots are the data points, joined by blue lines for clarity. Again, this should have been symmetric about x=0.5 (about a line this time, not about a point), but it isn’t. Notice that the low sequence lengths for numbers such as 0.125 or 0.375 like we discussed don’t even appear. The lowest sequence length we see here is about 35. That’s because these fractions never even arrived in the incrementing loop, although they should have. This is computational error again. I can tell because I have poked around a bit. Try out this python snippet for example:

c=0.
while c<=1.:
	c+=1e-2
	if c==.12:
		print c

By the way, one data point, corresponding to the fraction 0, had to be removed from this graph, because its sequence length was very big, 1074, as we saw before.

If you zoom into the middle of this graph a bit, however, you’ll see the kind of symmetry I had been looking for:

graph6

Do you see why we should have a picture like this? Think about it, it’s not very hard. Meanwhile you can download a wallpaper I made in Photoshop out of the above graph, because I liked it so much.

graphwallpaper

That’ll be it for now. Let me know if you have any ideas or questions about all this.