Python Script to Generate Frequency Counts of Words in a Text

The following python script takes a text file as input and produces an unsorted list of frequency counts of words in the text as an output text file. It’s pretty simple and short, and uses only the regular expressions module re of python, which is a standard library, so this script will run in any system with a standard python installation.

from re import compile
l=compile("([\w,.'\x92]*\w)").findall(open(raw_input('Input file: '),'r').read().lower())
f=open(raw_input('Output file: '),'w')
for word in set(l):
	print>>f, word, '\t', l.count(word)
f.close()

Note that ‘words’ here doesn’t mean dictionary words (with such a small script it’s not possible to check against dictionary words). Instead, ‘words’ are what you get when you split the text at regular expression word boundaries. So if you have a word like “365b1”, that’ll also be listed in the output.

Here’s an example.

Input file contains:

This is text. This is written with intentionally repeated words. This is repeated, intentionally, to produce short output. This text — with intentionally repeated words — is written to produce short text output.
This output is text. Intentionally short text.

Output file will contain:

short 			3
this 			5
text 			5
is 				5
repeated 		3
intentionally 	4
to 				2
written 		2
produce 		2
words 			2
output 			3
with 			2

The columns are tab-separated, so you can copy this into spreadsheets which should detect this as two columns and paste accordingly. Then you can draw frequency plots or whatever.
If you want, you can write a bit more code to sort the output alphabetically or by frequency count. I didn’t bother.

A/B and Rh Antigens in Blood Types: A Statistical Test of Independence among IISER Kolkata Students

A couple of days ago one of my juniors in college (Indian Institute of Science Education and Research, Kolkata) found unguarded in the guest account of our computing system a spreadsheet.

This spreadsheet contained the blood types of the Masters students across five batches (‘07-‘13). With this information he made a nice little bar diagram of the frequency distribution of blood types, which I lift here:

blood-group

When I looked at this graph, what I noticed first was that this contains the distribution across both the antigen type (A/B/AB/O) and the Rh factor (+/-). It struck me that it may be possible by constructing a contingency table to check whether these two properties are independent of each other.

For this I first constructed from the graph the two-way contingency table of the joint distribution of antigen and Rh factor:

 

Rh\Ag→ A B AB O Marginal totals
+ 76 116 28 122 342
1 4 1 6 12
Marginal totals 77 120 29 128 354 (total)

 

Consider now the marginal distributions, i.e. the subtotals along the last row and column. If we divide these numbers through by the total number of data points (354), we obtain the relative marginal frequencies. For a large sample, this may be identified with the probabilities of occurrence of the individual properties irrespective of that of the other.

Now comes the important part. If the occurrence of these two properties are independent of each other, then the joint probability in any cell shall be the product of the marginal probabilities for that row and column (P(A∩B)=P(A)P(B)). In terms of frequencies, this means that the number in any cell should be the product of the subtotals for that row and column, divided by the total (354). You can think of it this way: the frequency in any cell is the product of one of its marginal frequencies with the other relative marginal frequency (signifying the independent conditional probability). Example:

n(A+) = n(A).P(+) = n(A).n(+)/N.

Taking this assumption of independence then, it is possible to construct a contingency table inwards starting from just the outer marginal frequencies. This has been constructed below with the marginal frequencies of the actual data, and rounded to the nearest integers.

Rh\Ag→ A B AB O Marginal totals
+ 74 116 28 124 342
3 4 1 4 12
Marginal totals 77 120 29 128 354
(total)

If you compare the joint frequencies in the two tables, you can see immediately that they are very close, lending support to the assumption that these factors are independent.

For a more graphical idea, I decided to plot cluster bars of the actual and the computed frequencies assuming independence.

image

The average error between the actual and computed frequencies was 0 over the eight blood types. That doesn’t say much because errors of opposite signs reduce the net effect. The root mean squared error was 1.16, which too is tiny in comparison to the frequencies themselves.

Thus, it is pretty safe to suppose that the antigen and the Rh factor are uncorrelated properties.

There are stronger, more explicit tests of association than what has been done here. The usual correlation coefficient, though, cannot be calculated here because the variable (blood type) is not a quantitative one. However, there are others that you can read about here.

If you want to test this method or any of the other methods on other datasets of blood types, this page provides quite a bit of data for various countries (although last I checked, they seemed to not be completely factually correct).