Removing punctuations from list items using Python

Question

from glob import glob
pattern = "D:\\report\\shakeall\\*.txt"
filelist = glob(pattern)
def countwords(fp):
    with open(fp) as fh:
        return len(fh.read().split())
print "There are" ,sum(map(countwords, filelist)), "words in the files. " "From directory",pattern
import os
import re
import string
uniquewords = set([])
for root, dirs, files in os.walk("D:\\report\\shakeall"):
    for name in files:
        [uniquewords.add(x) for x in open(os.path.join(root,name)).read().split()]
wordlist = list(uniquewords)

This code counts the total number of unique and total words. However, the problem is, if I write len(uniquewords) , it shows unreasonable number because it recognizes for example, 'shake' 'shake!' 'shake,' and 'shake?' as different unique words. I've tried to remove punctuations from uniquewords by making the list and modifying it, everything failed. Can anybody help me?

[How to format code on SO](http://meta.stackexchange.com/questions/22186/how-do-i-format-my-code-blocks) — Levon, Aug 11 '12 at 11:25
Is there a reason why `"words in the files. " "From directory"` isn't simply `"words in the files. From directory"`? — Levon, Aug 11 '12 at 11:29
http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python — NIlesh Sharma, Aug 11 '12 at 11:30

tzelleke · Answer 1 · 2012-08-11T12:00:43.757

Use Regex with \w+ pattern to match words and exclude punctuation.
When counting in Python use collections.Counter

The example data to this code is appended at the end:

import re
from collections import Counter

pattern = re.compile(r'\w+')

with open('data') as f:
    text = f.read()

print Counter(pattern.findall(text))

gives:

Counter(
{'in': 4, 'the': 4, 'string': 3, 'matches': 3, 'are': 2,
'pattern': 2, '2': 2, 'and': 1, 'all': 1, 'finditer': 1,
'iterator': 1, 'over': 1, 'an': 1, 'instances': 1,
'scanned': 1, 'right': 1, 'RE': 1, 'another': 1, 'touch': 1,
'New': 1, 'to': 1, 'returned': 1, 'Return': 1, 'for': 1,
'0': 1, 're': 1, 'version': 1, 'Empty': 1, 'is': 1,
'match': 1, 'non': 1, 'unless': 1, 'overlapping': 1, 'they': 1, 'included': 1, 'The': 1, 'beginning': 1, 'MatchObject': 1,
'result': 1, 'of': 1, 'yielding': 1, 'flags': 1, 'found': 1,
'order': 1, 'left': 1})

data:

re.finditer(pattern, string, flags=0) Return an iterator yielding MatchObject instances over all non-overlapping matches for the RE pattern in string. The string is scanned left-to-right, and matches are returned in the order found. Empty matches are included in the result unless they touch the beginning of another match. New in version 2.2.

It doesn't matter in this case but in general [Counter is a slow option](http://stackoverflow.com/a/2525617/4279). You could read line-by-line to preserve memory for large files: `Counter(word for line in f for word in re.findall(r'\w+', line))` — jfs, Aug 11 '12 at 12:35

Removing punctuations from list items using Python

1 Answers1