0
from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
import re
import string
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
wordlist = list(uniquewords)

This code counts the total number of unique and total words. However, the problem is, if I write len(uniquewords) , it shows unreasonable number because it recognizes for example, 'shake' 'shake!' 'shake,' and 'shake?' as different unique words. I've tried to remove punctuations from uniquewords by making the list and modifying it, everything failed. Can anybody help me?

unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
rocksland
  • 163
  • 2
  • 4
  • 8
  • [How to format code on SO](http://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks) – Levon Aug 11 '12 at 11:25
  • Is there a reason why `"words in the files. " "From directory"` isn't simply `"words in the files. From directory"`? – Levon Aug 11 '12 at 11:29
  • 1
    http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python – NIlesh Sharma Aug 11 '12 at 11:30

1 Answers1

1
  1. Use Regex with \w+ pattern to match words and exclude punctuation.
  2. When counting in Python use collections.Counter

The example data to this code is appended at the end:

import re
from collections import Counter

pattern = re.compile(r'\w+')

with open('data') as f:
    text = f.read()

print Counter(pattern.findall(text))

gives:

Counter(
{'in': 4, 'the': 4, 'string': 3, 'matches': 3, 'are': 2,
'pattern': 2, '2': 2, 'and': 1, 'all': 1, 'finditer': 1,
'iterator': 1, 'over': 1, 'an': 1, 'instances': 1,
'scanned': 1, 'right': 1, 'RE': 1, 'another': 1, 'touch': 1,
'New': 1, 'to': 1, 'returned': 1, 'Return': 1, 'for': 1,
'0': 1, 're': 1, 'version': 1, 'Empty': 1, 'is': 1,
'match': 1, 'non': 1, 'unless': 1, 'overlapping': 1, 'they': 1, 'included': 1, 'The': 1, 'beginning': 1, 'MatchObject': 1,
'result': 1, 'of': 1, 'yielding': 1, 'flags': 1, 'found': 1,
'order': 1, 'left': 1})

data:

re.finditer(pattern, string, flags=0) Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. New in version 2.2.

tzelleke
  • 15,023
  • 5
  • 33
  • 49
  • It doesn't matter in this case but in general [Counter is a slow option](http://stackoverflow.com/a/2525617/4279). You could read line-by-line to preserve memory for large files: `Counter(word for line in f for word in re.findall(r'\w+', line))` – jfs Aug 11 '12 at 12:35