0

I am trying to find most frequent words in a text file in alphabetical order in this different program.

For example, the word: "that" is the most frequent word in the text file. So, it should be printed first: "that #"

It needs to be in this type of format as the program and as the answer below:

d = dict()

def counter_one():
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        word = line.strip().lower()
        d = counter_two(word, d)
    return d

def counter_two(word, d):
    d = dict()
    word_file = open('gg.txt')
    for line in word_file:
        if word not in d:
            d[word] = 1
        else:
            d[word] + 1
    return d

def diction(d):
    for key, val in d.iteritems():
        print key, val

counter_one()
diction(d)

It should run something like this in the shell:

>>>
Words in text: ###
Frequent Words: ###
that 11
the 11
we 10
which 10
>>>
Joe21
  • 45
  • 1
  • 2
  • 11

3 Answers3

3

One easy way to get frequency counts is to use the Counter class in the builtin collections module. It allows you to pass in a list of words and it will automatically count them all and map each word to its frequency.

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())

I used the lower() function to avoid counting "the" and "The" separately.

Then you can output them in frequency order with frequencies.most_common() or frequencies.most_common(n) if you only want the top n.

If you want to sort the resulting list by frequencies and then alphabetically for elements with the same frequencies, you can use the sorted builtin function with a key argument of lambda (x,y): (y,x). So, your final code to do this would be:

from collections import Counter
frequencies = Counter()
with open('gg.txt') as f:
  for line in f:
    frequencies.update(line.lower().split())
most_frequent = sorted(frequencies.most_common(4), key=lambda (x,y): (y,x))
for (word, count) in most_frequent:
  print word, count

Then the output will be

that 11
the 11
we 10
which 10
murgatroid99
  • 19,007
  • 10
  • 60
  • 95
  • If you're using Python 3 the lambda should be changed to `key=lambda x: (x[1], x[0])` (see [this](https://stackoverflow.com/a/15712231/2393963)) – Felipe Cortez Jun 19 '18 at 12:40
2

You can do this simpler using collection's Counter. First, count the words, then sort by the number of appearances of each word AND the word itself:

from collections import Counter

# Load the file and extract the words
lines = open("gettysburg_address.txt").readlines()
words = [ w for l in lines for w in l.rstrip().split() ]
print 'Words in text:', len(words)

# Use counter to get the counts
counts = Counter( words )

# Sort the (word, count) tuples by the count, then the word itself,
# and output the k most frequent
k = 4
print 'Frequent words:'
for w, c in sorted(counts.most_common(k), key=lambda (w, c): (c, w), reverse=True):
    print '%s %s' % (w, c)

Output:

Words in text: 278
Frequent words:
that 13
the 9
we 8
to 8
mdml
  • 22,442
  • 8
  • 58
  • 66
  • You can try running it as is yourself! I was trying to give you an alternative solution for solving this problem that is more concise. – mdml Nov 19 '13 at 21:39
  • I know and I appreciate that, but it has to be formatted a certain way like my code is. – Joe21 Nov 19 '13 at 21:40
  • Can you modify your question to specify that format? Or am I missing something? – mdml Nov 19 '13 at 21:41
  • And it shouldn't be printing out the way you have it. It should be printed vertically like I have it in my question. – Joe21 Nov 19 '13 at 21:43
  • That is what I want it to happen, but I do need it to print out all the words, though. Is there any way you could add something into my code to make it print that way? I don't use the from and import functions. That's a bit too advanced. – Joe21 Nov 19 '13 at 21:50
1

Why do you keep re-opening the file and creating new dictionaries? What does your code need to do?

create a new empty dictionary to store words {word: count}
open the file
work through each line (word) in the file
    if the word is already in the dictionary
        increment count by one
    if not
        add to dictionary with count 1

Then you can easily get the number of words

len(dictionary)

and the n most common words with their counts

sorted(dictionary.items(), key=lambda x: x[1], reverse=True)[:n]
jonrsharpe
  • 115,751
  • 26
  • 228
  • 437