0

This question ( Best way to strip punctuation from a string in Python ) deals with stripping punctuation from an individual string. However, I'm hoping to read text from an input file, but only print out ONE COPY of all strings without ending punctuation. I have started something like this:

f = open('#file name ...', 'a+')
for x in set(f.read().split()):
    print x

But the problem is that if the input file has, for instance, this line:

This is not is, clearly is: weird

It treats the three different cases of "is" differently, but I want to ignore any punctuation and have it print "is" only once, rather than three times. How do I remove any kind of ending punctuation and then put the resulting string in the set?

Thanks for any help. (I am really new to Python.)

Community
  • 1
  • 1
user16647
  • 165
  • 2
  • 3
  • 10
  • 1
    Are you sure you want to open the file in `a+` mode? `r` should be enough. – Matthias Jun 22 '12 at 14:47
  • You're correct that r is enough, however I'm hoping to later append to the file so I might as well put a+ there for future purposes. – user16647 Jun 22 '12 at 14:48

2 Answers2

1
import re

for x in set(re.findall(r'\b\w+\b', f.read())):

should be more able to distinguish words correctly.

This regular expression finds compact groups of alphanumerical characters (a-z, A-Z, 0-9, _).

If you want to find letters only (no digits and no underscore), then replace the \w with [a-zA-Z].

>>> re.findall(r'\b\w+\b', "This is not is, clearly is: weird")
['This', 'is', 'not', 'is', 'clearly', 'is', 'weird']
eumiro
  • 207,213
  • 34
  • 299
  • 261
0

You can use translation tables if you don't care about replacing your punctuation characters with white space, for eg.

>>> from string import maketrans
>>> punctuation = ",;.:"
>>> replacement = "    "
>>> trans_table = maketrans(punctuation, replacement)
>>> 'This is not is, clearly is: weird'.translate(trans_table)
'This is not is  clearly is  weird'
# And for your case of creating a set of unique words.
>>> set('This is not is  clearly is  weird'.split())
set(['This', 'not', 'is', 'clearly', 'weird'])
Christian Witts
  • 11,375
  • 1
  • 33
  • 46