Python Script to Generate Frequency Counts of Words in a Text

Tags

, , , , , ,

The following python script takes a text file as input and produces an unsorted list of frequency counts of words in the text as an output text file. It’s pretty simple and short, and uses only the regular expressions module re of python, which is a standard library, so this script will run in any system with a standard python installation.

from re import compile
l=compile("(\w[\w']*)").findall(open(raw_input('Input file: '),'r').read().lower())
f=open(raw_input('Output file: '),'w')
for word in set(l):
	print>>f, word, '\t', l.count(word)
f.close()

Note that ‘words’ here doesn’t mean dictionary words (with such a small script it’s not possible to check against dictionary words). Instead, ‘words’ are what you get when you split the text at regular expression word boundaries. So if you have a word like “365b1″, that’ll also be listed in the output.

Here’s an example.

Input file contains:

This is text. This is written with intentionally repeated words. This is repeated, intentionally, to produce short output. This text — with intentionally repeated words — is written to produce short text output.
This output is text. Intentionally short text.

Output file will contain:

short 			3
this 			5
text 			5
is 				5
repeated 		3
intentionally 	4
to 				2
written 		2
produce 		2
words 			2
output 			3
with 			2

The columns are tab-separated, so you can copy this into spreadsheets which should detect this as two columns and paste accordingly. Then you can draw frequency plots or whatever.
If you want, you can write a bit more code to sort the output alphabetically or by frequency count. I didn’t bother.

A/B and Rh Antigens in Blood Types: A Statistical Test of Independence among IISER Kolkata Students

Tags

, , , , , , ,

A couple of days ago one of my juniors in college (Indian Institute of Science Education and Research, Kolkata) found unguarded in the guest account of our computing system a spreadsheet.

This spreadsheet contained the blood types of the Masters students across five batches (‘07-‘13). With this information he made a nice little bar diagram of the frequency distribution of blood types, which I lift here:

blood-group

When I looked at this graph, what I noticed first was that this contains the distribution across both the antigen type (A/B/AB/O) and the Rh factor (+/-). It struck me that it may be possible by constructing a contingency table to check whether these two properties are independent of each other.

For this I first constructed from the graph the two-way contingency table of the joint distribution of antigen and Rh factor:

 

Rh\Ag→ A B AB O Marginal totals
+ 76 116 28 122 342
- 1 4 1 6 12
Marginal totals 77 120 29 128 354 (total)

 

Consider now the marginal distributions, i.e. the subtotals along the last row and column. If we divide these numbers through by the total number of data points (354), we obtain the relative marginal frequencies. For a large sample, this may be identified with the probabilities of occurrence of the individual properties irrespective of that of the other.

Now comes the important part. If the occurrence of these two properties are independent of each other, then the joint probability in any cell shall be the product of the marginal probabilities for that row and column (P(A∩B)=P(A)P(B)). In terms of frequencies, this means that the number in any cell should be the product of the subtotals for that row and column, divided by the total (354). You can think of it this way: the frequency in any cell is the product of one of its marginal frequencies with the other relative marginal frequency (signifying the independent conditional probability). Example:

n(A+) = n(A).P(+) = n(A).n(+)/N.

Taking this assumption of independence then, it is possible to construct a contingency table inwards starting from just the outer marginal frequencies. This has been constructed below with the marginal frequencies of the actual data, and rounded to the nearest integers.

Rh\Ag→ A B AB O Marginal totals
+ 74 116 28 124 342
- 3 4 1 4 12
Marginal totals 77 120 29 128 354
(total)

If you compare the joint frequencies in the two tables, you can see immediately that they are very close, lending support to the assumption that these factors are independent.

For a more graphical idea, I decided to plot cluster bars of the actual and the computed frequencies assuming independence.

image

The average error between the actual and computed frequencies was 0 over the eight blood types. That doesn’t say much because errors of opposite signs reduce the net effect. The root mean squared error was 1.16, which too is tiny in comparison to the frequencies themselves.

Thus, it is pretty safe to suppose that the antigen and the Rh factor are uncorrelated properties.

There are stronger, more explicit tests of association than what has been done here. The usual correlation coefficient, though, cannot be calculated here because the variable (blood type) is not a quantitative one. However, there are others that you can read about here.

If you want to test this method or any of the other methods on other datasets of blood types, this page provides quite a bit of data for various countries (although last I checked, they seemed to not be completely factually correct).

A Statistical Problem on Laptop Uptimes

Tags

, , , , , , ,

Suppose you are in a large university campus. Most students here use laptops, and if you look around, you’d see most of them either working, listening to music or doing something else on their laptops. Suppose now you think of a quick project, of listing the uptimes of the laptops (how long they’ve been running). In Windows this is quite simple. Under the ‘Performance’ tab in Task Manager, you’ll notice that ‘Uptime’ gives the duration for which the laptop has been running. This timer, however, keeps counting from where it left off if you resume from hibernation, as it should, but we shall assume that no such cases happen in our campus.

Now, the campus is huge, and many students are using their laptops. You figure out some methodical way of visiting each student so that there are no over- or under-counting errors. But it takes you quite a while to visit all of them and note down their uptimes. So that if, for example, it took you six hours to collect all the data, then you took the last uptime reading six hours after the first.

If we assume a very large campus where the uptimes of the laptops of different students are completely independent of each other, the questions are the following:

1. On an average, is it going to make any difference to the statistics you collected, if instead of taking a long time to go around the campus, you could somehow acquire all of the uptimes at one instant of time?

2. Is the data you collected going to be distributed differently from that of the maximum uptimes of student laptops (duration before they shut it down)?

Unless I find myself without the time and effort, I plan to return and solve it in this blog post. (I haven’t solved it yet.)

Image Appearance Variation across Desktop Viewers and Websites

Tags

, , , , , , , , , , , , , ,

I’ve been taking photographs, editing them on my laptop and posting them to several websites for a while now, and I’ve noticed that there are variations in the appearance of an image, mostly richness, sharpness and grains, among some common methods of viewing. These may be different image viewers on your computer, setting the image as your wallpaper, or different websites where you post the image. The differences may be subtle (if they were not, it would be a really big deal that wouldn’t persist till now), but to photographers, graphic designers and in general to people whose work or passion is to deal with images, I deem these are still big differences that may often cause problems. I do not know what causes these differences, but only wish to demonstrate them so that in case such subtle differences matter to you as they do to me, you will not waste frustrated hours trying to track down the problem to where it does not lie, such as your photography, editing or design.

So here’s a series of screenshots of the same photograph that I’ve taken, across various viewing methods, showing the variation. I have ordered them so that similar-looking appearances are clumped together. Within brackets are my ratings for how well I think they reproduced the image.

1. Photoshop (10/10)

PS

In my experience, the best reproduction. Colours are rich and the image appears sharp and without any additional noise.

6. Windows Photo Viewer (10/10)

Windows Photo Viewer

This is the default image viewer in Windows 7 that I use. Pretty much the same as PS. No complaints.

2. Windows Explorer Preview (9/10)

By that, I mean the large image preview that you get in Explorer in Win 7. Here’s a screenshot to explain what I mean:

Windows Preview

The image itself looked like this:

Windows Preview 1

Notice immediately that the image is slightly duller and not as sharp as PS or Windows Photo Viewer. In its defence though, this is not even an image viewer we are talking about, and to me this is good enough for a quick preview.

2. Facebook (9/10)

facebook

This is more or less similar, perhaps a tad duller.

3. Flickr (9/10)

flickr

Almost the same as Flickr. The sharpness is returning though.

5.Picasa (6/10)

Picasa

Disappointingly dull. I used to get pretty frustrated when after working long in PS on a photo to get the richness and clarity I like, I would save the image and see this kind of output in Picasa, my default image viewer. Then I realized that it is a fundamental difference between their respective image reproducibility that I can do nothing about, and shouldn’t worry about. No real richness has been lost. However, this being a dedicated image viewer that many people install and use over the default viewers, it is unpardonable on part of the developers at Google.

6. Windows Desktop Wallpaper (5/10)

Wallpaper

This one’s a disaster. Not only is it even duller than Picasa, the moment you set any image with the slightest of rich colour as your desktop background, there suddenly arises a lot of grain. At this size on this blog it is perhaps not visible, but you can click the image to view the full size and compare with the others, when the grain will be clearly discernible.

I have often happily set my photographs as the desktop background, only to disappointedly revert because of the grain. The default Windows wallpapers are rigged somehow so that they appear smooth. I think they had absolutely no grain to begin with. But if an image has the slightest grain, it is greatly amplified upon setting as the desktop background.

I am here.

Tags

, , , ,

The spirit of the universe was nowhere to be found
and we had charted light years of starways
through quiet dust-strewn blankets of forever dark
to find ourselves run out of road,
staring but into ancient void

There is nothing, we relayed
as we made to pull the plug
for there was nothing to stay on for

When the stardust-clad darkness gazed into us
as does a cliff into the mountaineer
breathed into our forgotten souls
and said,
I am here.

And we stood and watched,
explorers floating in deep space calm,
until I can remember no more.

Follow

Get every new post delivered to your Inbox.

Join 1,467 other followers