Remove punctuation from a list

Question

I'm working on taking a sample of the Declaration of Independence and calculating the frequency of the length of words in it.

Sample text from file:

"When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires 
that they should declare the causes which impel them to the separation."

Note: The word length cannot include any punctuation e.g. anything from string.punctuation.

Expected Outcome (sample):

Length Count
1 16
2 267
3 267
4 169
5 140
6 112
7 99
8 68
9 61
10 56
11 35
12 13
13 9
14 7
15 2

I'm currently stuck on removing punctuation from the file that I've converted into a list.

Here is what I've tried so far:

import sys
import string

def format_text(fname):
        punc = set(string.punctuation)
        words = fname.read().split()
        return ''.join(word for word in words if word not in punc)

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = format_text(fname)
print(formatted_text)

And what is exactly the issue? – tomasyany Jun 21 '15 at 23:33 — tomasyany, Jun 21 '15 at 23:33

Padraic Cunningham · Answer 1 · 2015-06-22T00:51:38.783

You can strip the punctuation from the words and also avoid reading all the file into memory:

punc = string.punctuation
return ' '.join(word.strip(punc) for line in fname for word in line.split())

If you want to remove the ' from Nature's then you will need translate:

from string import punctuation

# use ord of characters you want to replace as keys and what you want to replace them with as values
tbl = {ord(k):"" for k in punctuation}
return ' '.join(line.translate(tbl) for line in fname)

To get the frequency, use a Counter dict:

from collections import Counter
freq = Counter(len(word.translate(tbl)) for line in fname for word in line.split())

Or depending on your approach:

freq = Counter(len(word.strip(punc)) for line in fname for word in line.split())

Using the lines in your question above as an example:

lines =""""When in the Course of human events it becomes necessary for one people to dissolve the political bands which have connected them with another and to assume among the powers of the earth, the separate and equal station to which the Laws of Nature and of Nature's God entitle them, a decent respect to the opinions of mankind requires
that they should declare the causes which impel them to the separation."""

from collections import Counter
freq = Counter(len(word.strip(punctuation)) for line in lines.splitlines() for word in line.split())
print(freq.most_common())

Outputs tuples of key/value pairings starting with the word length seen the most all the way down to the least, the key is the length and the second element is the frequency:

[(3, 15), (2, 12), (4, 9), (5, 9), (6, 9), (7, 7), (8, 5), (9, 3), (1, 1), (10, 1)]

If you want to output the frequency starting from 1 letter words up without sorting and in order:

mx = max(freq.values())
for i in range(1, mx+1):
    v = freq[i]
    if v:
        print("length {} words appeared {} time/s.".format(i, v) )

Output:

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

For a missing key a Counter dict unlike a normal dict will not return a keyError but return a value of 0 so if v will only be True for word lengths that appeared in the file.

If you want to print the cleaned data putting all the logic in fucntions:

def clean_text(fname):
    punc = string.punctuation
    return [word.strip(punc) for line in fname for word in line.split()]


def get_freq(cleaned):
    return Counter(len(word) for word in cleaned)


def freq_output(d):
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))

try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')
formatted_text = clean_text(fname)

print(" ".join(formatted_text))
print()
freq = get_freq(formatted_text)

freq_output(freq)

Which run on your question snippet outputs:

~$ python test.py test.txt
When in the Course of human events it becomes necessary for one people  
to dissolve the political bands which have connected them with another
and to assume among the powers of the earth the separate and equal station 
 to which the Laws of Nature and of Nature's God entitle them a decent 
respect to the opinions of mankind requires that they should declare 
the causes which impel them to the separation

length 1 words appeared 1 time/s.
length 2 words appeared 12 time/s.
length 3 words appeared 15 time/s.
length 4 words appeared 9 time/s.
length 5 words appeared 9 time/s.
length 6 words appeared 9 time/s.
length 7 words appeared 7 time/s.
length 8 words appeared 5 time/s.
length 9 words appeared 3 time/s.
length 10 words appeared 1 time/s.

If you only care about the frequency output, do it all in one pass:

import sys
import string


def freq_output(fname):
    from string import punctuation

    tbl = {ord(k): "" for k in punctuation}
    d = Counter(len(word.strip(punctuation)) for line in fname for word in line.split())
    d = Counter(len(word.translate(tbl)) for line in fname for word in line.split())
    mx = max(d.values())
    for i in range(1, mx + 1):
        v = d[i]
        if v:
            print("length {} words appeared {} time/s.".format(i, v))


try:
    with open(sys.argv[1], 'r') as file_arg:
        file_arg.read()
except IndexError:
    print('You need to provide a filename as an arguement.')
    sys.exit()

fname = open(sys.argv[1], 'r')

freq_output(fname)

using whichever approach is correct for d.

@Jay_R, no worries, you can iterate over the file object and split, there is no point reading all the content into memory unless you want to use it. — Padraic Cunningham, Jun 21 '15 at 23:47
How would you approach the next part then? Calculating the frequency of the length of words. Dictionary? List? — Jay Py, Jun 21 '15 at 23:58
I'm a bit confused by where I'd implement that code. The next step I was taking was another function to take the output of the format_text function and calculating word length and frequency of it in another function. — Jay Py, Jun 22 '15 at 00:26
Do you actually care about the text at all? Do you just want the frequency or will you use the cleaned text somewhere else? — Padraic Cunningham, Jun 22 '15 at 00:28
@Jay_R, well I added how to piece it all together, if you just want the frequency then the Counter dict logic is all you need, passing the length of each cleaned word in the generator expression — Padraic Cunningham, Jun 22 '15 at 00:39
@Jay_R, no worries, how you output the data is up to yourself but a Counter dict is the way to go. — Padraic Cunningham, Jun 22 '15 at 00:57

score 2 · Answer 2 · edited May 23 '17 at 11:44

You can use translate to strip the punctuation:

import string

words = fname.read().translate(None, string.punctuation).split()

Best way to strip punctuation from a string in Python

py2.7:

import string
from collections import defaultdict
from collections import Counter

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(None, string.punctuation).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(None, string.punctuation).split()))
    return counts

print s1()
defaultdict(<type 'int'>, {1: 111, 2: 1169, 3: 1100, 4: 1470, 5: 1425, 6: 1318, 7: 1107, 8: 875, 9: 938, 10: 108, 11: 233, 12: 146})

print s2()
Counter({4: 1470, 5: 1425, 6: 1318, 2: 1169, 7: 1107, 3: 1100, 9: 938, 8: 875, 11: 233, 12: 146, 1: 111, 10: 108})

In python 2.7 using Counter is slower than building up a dictionary manually because the way Counter's update is implemented.

%timeit s1()
100 loops, best of 3: 4.42 ms per loop

%timeit s2()
100 loops, best of 3: 9.27 ms per loop

py3:

I think in python 3.2 Counter was updated and became equal or faster than manually building the counter dictionary.

also python3's translate changed to be less verbose:

import string
from collections import defaultdict
from collections import Counter

strip_punct = str.maketrans('','',string.punctuation)

def s1():
    with open("myfile.txt", "r") as f:
        counts = defaultdict(int)
        for line in f:
            words = line.translate(strip_punct).split()
            for length in map(len, words):
                counts[length] += 1
    return counts

def s2():
    with open("myfile.txt", "r") as f:
        counts = Counter(length for line in f for length in map(len, line.translate(strip_punct).split()))
    return counts

print(s1())
defaultdict(<class 'int'>, {1: 111, 2: 1169, 3: 1100, 4: 1470, 5: 1425, 6: 1318, 7: 1107, 8: 875, 9: 938, 10: 108, 11: 233, 12: 146})

print(s2())
Counter({4: 1470, 5: 1425, 6: 1318, 2: 1169, 7: 1107, 3: 1100, 9: 938, 8: 875, 11: 233, 12: 146, 1: 111, 10: 108})

%timeit s1()
100 loops, best of 3: 11.4 ms per loop

%timeit s2()
100 loops, best of 3: 11.2 ms per loop

Can you add a link to where the docs say Counter is slower than building a dict manually? — Padraic Cunningham, Jun 22 '15 at 00:55
Here is a blog post about it. http://katrinaeg.com/python-counter-performance.html. Also this so question mentions it http://stackoverflow.com/questions/27801945/surprising-results-with-python-timeit-counter-vs-defaultdict-vs-dict. Sorry took me a while to find. — dting, Jun 22 '15 at 01:06
Interesting. I suppose the bells and whistles a Counter provides offsets any overhead. Still surprising that it should not have always been faster considering its exact purpose is to count — Padraic Cunningham, Jun 22 '15 at 01:20

score 0 · Answer 3 · answered Jun 22 '15 at 00:05

You can use regular expressions:

import re

def format_text(fname, pattern):
    words = fname.read()
    return re.sub(p, '', words)

p = re.compile(r'[!&:;",.]')
fh = open('C:/Projects/ExplorePy/test.txt')
text = format_text(fh, p)

Apply split() as you like, and the pattern can be refined.

Remove punctuation from a list

3 Answers3

Linked