0

I tried multiple solutions here, and although they strip some code, they dont seem to work on multiple punctuations ex. "[ or ', This code:

regex = re.compile('[%s]' % re.escape(string.punctuation))
    for i in words:
        while regex.match(i):
            regex.sub('', i)

I got from: Best way to strip punctuation from a string in Python was good but i still encounter problems with double punctuations. I added While loop in hope to ittirate over each word to remove multiple punctuations but that does not seem to work it just gets stuck on the first item "[ and does not exit it

Am I just missing some obvious piece that I am just being oblivious too?

I solved the problem by adding a redundancy and double looping my lists, this takes extremely long time (well into the minutes) due to fairly large sets

I use Python 2.7

Community
  • 1
  • 1
rodling
  • 988
  • 5
  • 18
  • 44
  • 1
    Can you add a sample string where `regex.sub('',???)` doesn't work? (In other words, fill in `???`). – mgilson Sep 11 '12 at 18:23
  • not sure what you are asking to be honest, sample string in place of i? – rodling Sep 11 '12 at 18:36
  • 1
    Yep, that's what I'm asking for. To me, it looks like the regex/sub you have should work ... a concrete example showing it fail would be helpful. – mgilson Sep 11 '12 at 18:40
  • I added print i above regex.sub('', i) and it leads to infinite [" had to Control C out of it. I tested punctuation set, both items are included. I am baffled – rodling Sep 11 '12 at 19:19
  • So the string it has trouble with is `["` -- because the re you're using has no problem turning `'["'` into `''` (as for the infinite loop, check out the answer by @LukasGraf). – mgilson Sep 11 '12 at 19:21

3 Answers3

3

Your code doesn't work because regex.match needs the beginning of the string or complete string to match.

Also, you did not do anything with the return value of regex.sub(). sub doesn't work in place, but you need to assign its result to something.

regex.search returns a match if the pattern is found anywhere in the string and works as expected:

import re
import string

words = ['a.bc,,', 'cdd,gf.f.d,fe']

regex = re.compile('[%s]' % re.escape(string.punctuation))
for i in words:
    while regex.search(i):
        i = regex.sub('', i)
    print i

Edit: As pointed out below by @senderle, the while clause isn't necessary and can be left out completely.

Lukas Graf
  • 30,317
  • 8
  • 77
  • 92
  • Oh, right. *Slaps forehead*. But -- in that case, is the `while` clause necessary at all? Why not just call `sub` directly? – senderle Sep 11 '12 at 18:44
  • @senderle Good point, you're absolutely right, the while clause is unnecessary. – Lukas Graf Sep 11 '12 at 18:49
  • Still doesnt work, I added if regex.search(i) where your print i is, it still enters that if statement, i dont even know how thats possible – rodling Sep 11 '12 at 20:45
  • @rodling Did you also read my comment about assigning the return value of `re.sub`? Notice the `i = ...` in my code. – Lukas Graf Sep 11 '12 at 20:52
  • @rodling, then you're additionally doing something wrong in another part of your code we don't get to see. The code I posted works, and if you replace `print i` in that snippet with `if regex.search(i): ...` it will _not_ step into that `if` statement. Can you reduce your problem to a test case that is executable and complete with some data, and demonstrate why it doesn't behave as it should? – Lukas Graf Sep 11 '12 at 21:20
  • @rodling Also, the code as I posted it of course does not change the list `words` (strings are passed by value, not reference). It writes the result of the substitution to `i` which gets overwritten every iteration. That's why I `print` it, I wanted to keep the difference to your code as minimal as possible. – Lukas Graf Sep 11 '12 at 21:31
  • @LukasGraf fixed it! i was writing the variables wrong regex = re.compile('[%s]' % re.escape(string.punctuation)) clean_words= [] for i in words: if not regex.search(i): clean_words.append(i) Used that because i dont need '' space in the list. Thanks for the help! – rodling Sep 11 '12 at 21:40
  • @rodling Glad you got it to work :) One last tip though: You don't need the whole `regex.search` business. Simply run `regex.sub` on all your words: `clean_words = [regex.sub('', w) for w in words]` - one line, not counting the regex itself. – Lukas Graf Sep 11 '12 at 21:45
2

this will replace everything not alphanumeric ...

re.sub("[^a-zA-Z0-9 ]","",my_text)


>>> re.sub("[^a-zA-Z0-9 ]","","A [Black. Cat' On a Hot , tin roof!")
'A Black Cat On a Hot  tin roof'
Joran Beasley
  • 110,522
  • 12
  • 160
  • 179
0

Here is a simple way:

>>> print str.translate("My&& Dog's {{{%!@#%!@#$L&&&ove Sal*mon", None,'~`!@#$%^&*()_+=-[]\|}{;:/><,.?\"\'')
>>> My Dogs Love Salmon

Using this str.translate function will eliminate the punctuation. I usually use this for eliminating numbers from DNA sequence reads.

chimpsarehungry
  • 1,775
  • 2
  • 17
  • 28