0

I know there are tons of examples about removing punctuation but I want to know the most efficient way to do this. I have a list of words that I read from a txt file and split

wordlist = open('Tyger.txt', 'r').read().split()

What is the fastest way to check each word and remove any punctuation? I can do it with a bunch of code but I know it is not the easiest way.

Thanks!!

English Grad
  • 1,365
  • 5
  • 21
  • 40
  • Can you provide a sample input and output (or delineate what makes up your set of punctuation)? – Levon Jun 07 '12 at 16:14
  • sure no problem. The text file is a poem. the first two lines read:Tyger! Tyger! burning bright In the forests of the night, I would like them to end up in the list with not commas or exclamation marks. The set of puntuation I need removed is "-,!?. Thanks! – English Grad Jun 07 '12 at 16:15
  • looks like a duplicate to this http://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string-in-python – Joran Beasley Jun 07 '12 at 16:16
  • @JoranBeasley: I don't think this is a dup. My answer fits this question, but not the other one. – Sven Marnach Jun 07 '12 at 16:17

4 Answers4

2

I think the easiest way is to only extract words consisting of letters in the first place:

import re

with open("Tyger.txt") as f:
    words = re.findall("\w+", f.read())
Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
  • how would this deal with special chars which are not punctuation? – luke14free Jun 07 '12 at 16:21
  • This works great thanks. I really appreciate the help. I learn so much from all of you guys – English Grad Jun 07 '12 at 16:21
  • 1
    @EnglishGrad: Note Sven's use of the `with` keyword to open the input file. Using a `with` block is preferred to using `f = open()... close()` and _much_ preferred to using `stuff = open().read()...`. In the last example you lose the ability to explicitly `close()` the file after reading/writing. – Joel Cornett Jun 07 '12 at 16:28
  • @luke14free: By providing the `re.LOCALE` or `re.UNICODE` flags and set the locale, you can make this perform as desired. For standard strings without any flags, it would only match the set `[a-zA-Z0-9_]`. See the [documentation](http://docs.python.org/library/re.html#regular-expression-syntax) for further details. – Sven Marnach Jun 07 '12 at 16:41
1

For example:

text = """
Tyger! Tyger! burning bright
In the forests of the night,
What immortal hand or eye
Could frame thy fearful symmetry? 
"""
import re
words = re.findall(r'\w+', text)

or

import string
ps = string.punctuation
words = text.translate(string.maketrans(ps, ' ' * len(ps))).split()

The second one is much faster.

georg
  • 211,518
  • 52
  • 313
  • 390
1

I would go with something like this:

import re
with open("Tyger.txt") as f:
    print " ".join(re.split("[\-\,\!\?\.]", f.read())

It will remove only what is really needed and wont create excessive overload due to overmatching.

luke14free
  • 2,529
  • 1
  • 17
  • 25
1
>>> import re

>>> the_tyger
'\n    Tyger! Tyger! burning bright \n    In the forests of the night, \n    What immortal hand or eye \n    Could frame thy fearful symmetry? \n    \n    In what distant deeps or skies \n    Burnt the fire of thine eyes? \n    On what wings dare he aspire? \n    What the hand dare sieze the fire? \n    \n    And what shoulder, & what art. \n    Could twist the sinews of thy heart? \n    And when thy heart began to beat, \n    What dread hand? & what dread feet? \n    \n    What the hammer? what the chain? \n    In what furnace was thy brain? \n    What the anvil? what dread grasp \n    Dare its deadly terrors clasp? \n    \n    When the stars threw down their spears, \n    And watered heaven with their tears, \n    Did he smile his work to see? \n    Did he who made the Lamb make thee? \n    \n    Tyger! Tyger! burning bright \n    In the forests of the night, \n    What immortal hand or eye \n    Dare frame thy fearful symmetry? \n    '

>>> print re.sub(r'["-,!?.]','',the_tyger)

Prints:

Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Could frame thy fearful symmetry 

In what distant deeps or skies 
Burnt the fire of thine eyes 
On what wings dare he aspire 
What the hand dare sieze the fire 

And what shoulder  what art 
Could twist the sinews of thy heart 
And when thy heart began to beat 
What dread hand  what dread feet 

What the hammer what the chain 
In what furnace was thy brain 
What the anvil what dread grasp 
Dare its deadly terrors clasp 

When the stars threw down their spears 
And watered heaven with their tears 
Did he smile his work to see 
Did he who made the Lamb make thee 

Tyger Tyger burning bright 
In the forests of the night 
What immortal hand or eye 
Dare frame thy fearful symmetry 

Or, with a file:

>>> with open('tyger.txt', 'r') as WmBlake:
...    print re.sub(r'["-,!?.]','',WmBlake.read())

And if you want to create a list of the lines:

>>> lines=[]
>>> with open('tyger.txt', 'r') as WmBlake:
...    lines.append(re.sub(r'["-,!?.]','',WmBlake.read()))
the wolf
  • 34,510
  • 13
  • 53
  • 71
  • 1
    +1 for posting the complete poem ;) Although it looks more like Bukowski now than Blake. – georg Jun 07 '12 at 16:41