1

I have a string and I'm trying to remove all characters that are not alphanumeric nor in this set

'''!$%*()_-=+\/.,><:;'"?|'''.

I know this removes all non alphanumeric characters but how can I do better?

re.sub(r'\W+','',line)
Youcha
  • 1,534
  • 2
  • 16
  • 30

3 Answers3

7

A Python 2.x non-regex solution:

punctuation = '''!$%*()_-=+\/.,><:;'"?|'''
allowed = string.digits + string.letters + punctuation
filter(allowed.__contains__, s)

The string to filter is s. (This probably isn't the fastest solution for long strings.)

Sven Marnach
  • 574,206
  • 118
  • 941
  • 841
4

With credit to this thread: Remove specific characters from a string in python

First, there's no need to retype all the punctuation manually. The string module defines string.punctuation as a property for your convenience. (Use help(string) to see other similar definitions available)

>>> import string
>>>string.punctuation
'!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~'

The exact application of the solution will take some fiddling to define undesired characters; a big downside is that in this form, it only removes the characters you tell it to remove. If you're sure your file is 100% ASCII characters, then you could define:

>>> delchars = ''.join(c for c in map(chr, range(256)) if c not in (string.punctuation + string.digits + string.letters) )

You can filter characters by throwing them out:

>>> text.translate(None, delchars)

EDIT: Here's some interesting timing information for the various methods: Stripping everything but alphanumeric chars from a string in Python

Community
  • 1
  • 1
abought
  • 2,652
  • 1
  • 18
  • 13
  • While `str.translate()` can be used for this purpose (and is probably faster than the solution I gave), the exact given code *removes* punctuation, while the OP wants to *retain* the punctuation in the given set. – Sven Marnach May 31 '12 at 19:13
  • Thanks for the catch. I've adjusted my solution accordingly. You might also be interested in the link to performance tests done in another thread. – abought May 31 '12 at 20:06
1

In Python 3.x, you can use the translate method on string to remove characters you do not want:

>>> def remove(string, characters):
        return string.translate(str.maketrans('', '', characters))

>>> import string
>>> remove(string.printable, string.ascii_letters + string.digits + \
                             '''!$%*()_-=+\/.,><:;'"?|''')
'#&@[]^`{}~ \t\n\r\x0b\x0c'
Noctis Skytower
  • 21,433
  • 16
  • 79
  • 117