-4
wordcount = {}
    for vocab in file.read().split():
        if vocab not in wordcount:
            wordcount[vocab] = 1
        else:
            wordcount[vocab] = wordcount[vocab] + 1
    for (word,number) in wordcount.items():
        print (word, number)
print (word_count(0))
Morgan Thrapp
  • 9,748
  • 3
  • 46
  • 67
Oscar
  • 15
  • 5
  • 2
    What's the question? – Chris Mueller Nov 04 '16 at 14:24
  • How can I remove the punctuation from the dictionary when I print it? When I print it returns a lot of punctuation at the end of words – Oscar Nov 04 '16 at 14:25
  • 1
    Maybe you should remove punctuation from the text before putting words in the dict. – polku Nov 04 '16 at 14:29
  • As polku said, you should be [stripping](https://docs.python.org/3/library/stdtypes.html#str.rstrip) the punctuation from words before adding them to the dict. Also consider using a [Counter](https://docs.python.org/3/library/collections.html#collections.Counter) instead of a plain dict. – PM 2Ring Nov 04 '16 at 14:36

1 Answers1

0

As PM 2Ring notes, a Counter object would be useful here or simply a defaultdict from the collections library. We can use the regular expression package re for a more powerful re.split() or simply re.findall():

from re import findall, IGNORECASE
from operator import itemgetter
from collections import defaultdict

wordcount = defaultdict(int)

file = open("license.txt")

for vocab in findall(r"[A-Z]+", file.read(), flags=IGNORECASE):
    wordcount[vocab.lower()] += 1

for word, number in sorted(wordcount.items(), key=itemgetter(1), reverse=True):
    print(word, number)

OUTPUT

> python3 test.py
the 77
or 54
of 48
to 47
software 44
and 36
any 36
for 23
license 22
you 20
this 19
agreement 18
be 17
by 16
in 16
other 14
may 13
use 11
not 10
that 10
...

There are always trade-offs: you might want to fine tune the pattern to allow hyphenated words or apostrophes, depending on your application.

Reading the entire file in and processing it is fine if the input file is relatively small. If not, read it line by line in a loop with readline() instead and process each line in turn.

cdlane
  • 40,441
  • 5
  • 32
  • 81