0

I'm trying to create a dictionary of words from a text file and then count the instance of each word and be able to search for a word in the dictionary and receive its count but I am at a stand still. I am having the most trouble making the text file words lowercase and removing their punctuation because otherwise my count will be off. Any suggestions?

f=open("C:\Users\Mark\Desktop\jefferson.txt","r")
wc={}
words = f.read().split()
count = 0
i = 0
for line in f: count += len(line.split())
for w in words: if i < count: words[i].translate(None, string.punctuation).lower() i += 1 else: i += 1 print words
for w in words: if w not in wc: wc[w] = 1 else: wc[w] += 1
print wc['states']
DOOM
  • 1,170
  • 6
  • 20
Murph
  • 103
  • 1
  • 2
  • 8

3 Answers3

1

A few points:

In Python, always use the following construct for reading files:

 with open('ls;df', 'r') as f:
     # rest of the statements

If you use f.read().split(), then it will read to the end of the file. After that you will need to go back to the beginning:

f.seek(0)

Third, the part where you do:

for w in words: 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
        i += 1 
    else: 
        i += 1 
        print words

You dont need to keep a counter in Python. You can simply do ...

for i, w in enumerate(words): 
    if i < count: 
        words[i].translate(None, string.punctuation).lower() 
    else: 
        print words

However, you don't even need to check for i < count here... You can simply do:

words = [w.translate(None, string.punctuation).lower() for w in words]

Finally, if you just want to count states and not create an entire dictionary of items, consider using filter ....

print len(filter( lambda m: m == 'states', words ))

One last thing ...

If the file is large, it is inadvisable to put every word in memory at once. Consider updating the wc dictionary line by line. Instead of doing what you did, you can consider:

for line in f: 
    words = line.split()
    # rest of your code
Boris Gorelik
  • 29,945
  • 39
  • 128
  • 170
ssm
  • 5,277
  • 1
  • 24
  • 42
1

This sounds like a job for collections.Counter:

import collections

with open('gettysburg.txt') as f:
    c = collections.Counter(f.read().split())

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

Result:

$ python foo.py 
'Four' appears 1 times
'the' appears 9 times
There are 267 total words
The 5 most common words are [('that', 10), ('the', 9), ('to', 8), ('we', 8), ('a', 7)]

Of course, this counts "liberty," and "this." as words (note the punctuation in the word). Also, it counts "The" and "the" as distinct words. Also, dealing with the file as a whole can be a lose on very large files.

Here is a version that ignores punctuation and case, and is more memory-efficient on large files.

import collections
import re

with open('gettysburg.txt') as f:
    c = collections.Counter(
        word.lower()
        for line in f
        for word in re.findall(r'\b[^\W\d_]+\b', line))

print "'Four' appears %d times"%c['Four']
print "'the' appears %d times"%c['the']
print "There are %d total words"%sum(c.values())
print "The 5 most common words are", c.most_common(5)

Result:

$ python foo.py 
'Four' appears 0 times
'the' appears 11 times
There are 271 total words
The 5 most common words are [('that', 13), ('the', 11), ('we', 10), ('to', 8), ('here', 8)]

References:

Community
  • 1
  • 1
Robᵩ
  • 163,533
  • 20
  • 239
  • 308
0
File_Name = 'file.txt'
counterDict={}

with open(File_Name,'r') as fh:
    for line in fh:
   # removing their punctuation
        words = line.replace('.','').replace('\'','').replace(',','').lower().split()
        for word in words:
            if word not in counterDict:
                counterDict[word] = 1
            else:
                counterDict[word] = counterDict[word] + 1

print('Count of the word > common< :: ',  counterDict.get('common',0))
Fuji Komalan
  • 1,979
  • 16
  • 25