As recommended above, the Counter
class from the collections
module is definitely the way to go for counting applications.
This solution also addresses the request to count words in multiple files using the fileinput.input()
method to iterate over the contents of all the filenames specified on the command line (or if no filenames specified on the command line then will read from STDIN
, typically the keyboard)
Finally it uses a little more sophisticated approach for breaking the line into 'words' with a regular expression as a delimiter. As noted in the code it will handle contractions more gracefully (however it will be confused by apostrophes being used a single quotes)
"""countwords.py
count all words across all files
"""
import fileinput
import re
import collections
# create a regex delimiter that is any character that is not 1 or
# more word character or an apostrophe, this allows contractions
# to be treated as a word (eg can't won't didn't )
# Caution: this WILL get confused by a line that uses apostrophe
# as a single quote: eg 'hello' would be treated as a 7 letter word
word_delimiter = re.compile(r"[^\w']+")
# create an empty Counter
counter = collections.Counter()
# use fileinput.input() to open and read ALL lines from ALL files
# specified on the command line, or if no files specified on the
# command line then read from STDIN (ie the keyboard or redirect)
for line in fileinput.input():
for word in word_delimiter.split(line):
counter[word.lower()] += 1 # count case insensitively
del counter[''] # handle corner case of the occasional 'empty' word
# compute the total number of words using .values() to get an
# generator of all the Counter values (ie the individual word counts)
# then pass that generator to the sum function which is able to
# work with a list or a generator
total = sum(counter.values())
# iterate through the key/value pairs (ie word/word_count) in sorted
# order - the lambda function says sort based on position 1 of each
# word/word_count tuple (ie the word_count) and reverse=True does
# exactly what it says = reverse the normal order so it now goes
# from highest word_count to lowest word_count
print("{:>10s} {:>8s} {:s}".format("occurs", "percent", "word"))
for word, count in sorted(counter.items(),
key=lambda t: t[1],
reverse=True):
print ("{:10d} {:8.2f}% {:s}".format(count, count/total*100, word))
Example output:
$ python3 countwords.py
I have a dog, he is a good dog, but he can't fly
^D
occurs percent word
2 15.38% a
2 15.38% dog
2 15.38% he
1 7.69% i
1 7.69% have
1 7.69% is
1 7.69% good
1 7.69% but
1 7.69% can't
1 7.69% fly
And:
$ python3 countwords.py text1 text2
occurs percent word
2 11.11% hello
2 11.11% i
1 5.56% there
1 5.56% how
1 5.56% are
1 5.56% you
1 5.56% am
1 5.56% fine
1 5.56% mark
1 5.56% where
1 5.56% is
1 5.56% the
1 5.56% dog
1 5.56% haven't
1 5.56% seen
1 5.56% him