Python Script to Generate Frequency Counts of Words in a Text

The following python script takes a text file as input and produces an unsorted list of frequency counts of words in the text as an output text file. It’s pretty simple and short, and uses only the regular expressions module re of python, which is a standard library, so this script will run in any system with a standard python installation.

from re import compile
l=compile("([\w,.'\x92]*\w)").findall(open(raw_input('Input file: '),'r').read().lower())
f=open(raw_input('Output file: '),'w')
for word in set(l):
	print>>f, word, '\t', l.count(word)
f.close()

Note that ‘words’ here doesn’t mean dictionary words (with such a small script it’s not possible to check against dictionary words). Instead, ‘words’ are what you get when you split the text at regular expression word boundaries. So if you have a word like “365b1”, that’ll also be listed in the output.

Here’s an example.

Input file contains:

This is text. This is written with intentionally repeated words. This is repeated, intentionally, to produce short output. This text — with intentionally repeated words — is written to produce short text output.
This output is text. Intentionally short text.

Output file will contain:

short 			3
this 			5
text 			5
is 				5
repeated 		3
intentionally 	4
to 				2
written 		2
produce 		2
words 			2
output 			3
with 			2

The columns are tab-separated, so you can copy this into spreadsheets which should detect this as two columns and paste accordingly. Then you can draw frequency plots or whatever.
If you want, you can write a bit more code to sort the output alphabetically or by frequency count. I didn’t bother.

Advertisements

3 thoughts on “Python Script to Generate Frequency Counts of Words in a Text

  1. Pingback: Why does the cmd window close when the Python script ends? - 4bytes.co.uk

    • Well, first, instead of printing out the results, you can store them in a dictionary, where the keys are the words and the values are the frequencies. Then you have to sort the dictionary by value. You can search in Stack Overflow for details of these.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s