2

I'm new to python and I want to Count the number of times each word occurs across all the files. Display each word, the number of times it occurred and the percentage of time it occurred. Sort the list so the most frequent word appears first, and the least frequent word appears last. I'm working on small sample right know just one file but I can't get to work right,

 from collections import defaultdict

words = "apple banana apple strawberry banana lemon"

d = defaultdict(int)
for word in words.split():
    d[word] += 1
nubby
  • 23
  • 5
  • 1
    Take a look at [collections.Counter](https://docs.python.org/3.7/library/collections.html#collections.Counter) – Dani Mesejo Dec 04 '19 at 00:13
  • 3
    What isn't working? "I can't get to work right" isn't helpful, diagnostically speaking. Could you provide example input, expected output, and actual output (including traceback, if an exception is thrown)? These are the basics of a [MCVE]. As @RaySteam says, `collections.Counter` is how you do this in real code, but for a learning exercise/homework, you may want to implement it yourself. – ShadowRanger Dec 04 '19 at 00:14

4 Answers4

2

As recommended above, the Counter class from the collections module is definitely the way to go for counting applications.

This solution also addresses the request to count words in multiple files using the fileinput.input() method to iterate over the contents of all the filenames specified on the command line (or if no filenames specified on the command line then will read from STDIN, typically the keyboard)

Finally it uses a little more sophisticated approach for breaking the line into 'words' with a regular expression as a delimiter. As noted in the code it will handle contractions more gracefully (however it will be confused by apostrophes being used a single quotes)

"""countwords.py
   count all words across all files
"""

import fileinput
import re
import collections

# create a regex delimiter that is any character that is  not 1 or
# more word character or an apostrophe, this allows contractions
# to be treated as a word (eg can't  won't  didn't )
# Caution: this WILL get confused by a line that uses apostrophe
# as a single quote: eg 'hello' would be treated as a 7 letter word

word_delimiter = re.compile(r"[^\w']+")

# create an empty Counter

counter = collections.Counter()

# use fileinput.input() to open and read ALL lines from ALL files
# specified on the command line, or if no files specified on the
# command line then read from STDIN (ie the keyboard or redirect)

for line in fileinput.input():
    for word in word_delimiter.split(line):
        counter[word.lower()] += 1   # count case insensitively

del counter['']   # handle corner case of the occasional 'empty' word

# compute the total number of words using .values() to get an
# generator of all the Counter values (ie the individual word counts)        
# then pass that generator to the sum function which is able to 
# work with a list or a generator

total = sum(counter.values())

# iterate through the key/value pairs (ie word/word_count) in sorted
# order - the lambda function says sort based on position 1 of each
# word/word_count tuple (ie the word_count) and reverse=True does
# exactly what it says = reverse the normal order so it now goes
# from highest word_count to lowest word_count

print("{:>10s}  {:>8s} {:s}".format("occurs", "percent", "word"))

for word, count in sorted(counter.items(),
                          key=lambda t: t[1],
                          reverse=True):
    print ("{:10d} {:8.2f}% {:s}".format(count, count/total*100, word))

Example output:

$ python3 countwords.py
I have a dog, he is a good dog, but he can't fly
^D

occurs   percent word
     2    15.38% a
     2    15.38% dog
     2    15.38% he
     1     7.69% i
     1     7.69% have
     1     7.69% is
     1     7.69% good
     1     7.69% but
     1     7.69% can't
     1     7.69% fly

And:

$ python3 countwords.py text1 text2
    occurs   percent word
         2    11.11% hello
         2    11.11% i
         1     5.56% there
         1     5.56% how
         1     5.56% are
         1     5.56% you
         1     5.56% am
         1     5.56% fine
         1     5.56% mark
         1     5.56% where
         1     5.56% is
         1     5.56% the
         1     5.56% dog
         1     5.56% haven't
         1     5.56% seen
         1     5.56% him
quizdog
  • 473
  • 3
  • 8
1

Using your code, here's a neater approach:

# Initializing Dictionary
d = {}
with open(sys.argv[1], 'r') as f:

    # counting number of times each word comes up in list of words (in dictionary)
    for line in f: 
        words = line.lower().split() 
        # Iterate over each word in line 
        for word in words: 
            if word not in d.keys():
                d[word] = 1
            else:
                d[word]+=1

n_all_words = sum([k.values])

# Print percentage occurance
for k, v in d.items():
    print(f'{k} occurs {v} times and is {(100*v/n_all_words):,.2f}% total of words.')


# Sort a dictionary using this useful solution
# https://stackoverflow.com/a/613218/10521959
import operator
sorted_d = sorted(d.items(), key=operator.itemgetter(1))
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
1

As mentioned in the comments, this is precisely collections.Counter

words = 'a b c a'.split()
print(Counter(words).most_common())

From docs: https://docs.python.org/2/library/collections.html

most_common([n])
Return a list of the n most common elements and their counts
from the most common to the least. If n is omitted or None,
most_common() returns all elements in the counter.
Elements with equal counts are ordered arbitrarily:

>>> Counter('abracadabra').most_common(3)
[('a', 5), ('r', 2), ('b', 2)]
Cireo
  • 4,197
  • 1
  • 19
  • 24
0

the most straightforward way to do this is just using the Counter function:

from collections import Counter
c = Counter(words.split())

output:

Counter({'apple': 2, 'banana': 2, 'strawberry': 1, 'lemon': 1})

to just get the words in order, or the counts:

list(c.keys())
list(c.values())

or put it into a normal dict:

dict(c.items())

or list of tuples:

c.most_common()
Derek Eden
  • 4,403
  • 3
  • 18
  • 31