29

I have this code for removing all punctuation from a regex string:

import regex as re    
re.sub(ur"\p{P}+", "", txt)

How would I change it to allow hyphens? If you could explain how you did it, that would be great. I understand that here, correct me if I'm wrong, P with anything after it is punctuation.

John
  • 3,037
  • 8
  • 36
  • 68
  • 4
    @Jerry - I looked a little, and found this: http://stackoverflow.com/a/4316097/7586 - This is `regex`, not `re`. I guess they have two. – Kobi Jan 18 '14 at 20:14
  • @Kobi Oh... I guess that explains it. – Jerry Jan 18 '14 at 20:16

4 Answers4

28
[^\P{P}-]+

\P is the complementary of \p - not punctuation. So this matches anything that is not (not punctuation or a dash) - resulting in all punctuation except dashes.

Example: http://www.rubular.com/r/JsdNM3nFJ3

If you want a non-convoluted way, an alternative is \p{P}(?<!-): match all punctuation, and then check it wasn't a dash (using negative lookbehind).
Working example: http://www.rubular.com/r/5G62iSYTdk

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
Kobi
  • 135,331
  • 41
  • 252
  • 292
  • 1
    Great, thanks. What about excluding multiple? Such as '.' as well. – John Jan 18 '14 at 20:09
  • 1
    @Anonymous - The first one would be `[^\P{P}\-.]+`, and the second `\p{P}(?<![\-.])`. Pretty straightforward. – Kobi Jan 18 '14 at 20:13
  • Why was it necessary to have '\' now after {P} and not in the first? – John Jan 18 '14 at 20:15
  • @Anonymous - Good question! It isn't strictly *necessary* - I usually prefer to include it. `-` has a spacial meaning in character class, like in `[a-z]` - it indicated a range. I usually like to escape it, to avoid potential bugs. – Kobi Jan 18 '14 at 20:17
25

Here's how to do it with the re module, in case you have to stick with the standard libraries:

# works in python 2 and 3
import re
import string

remove = string.punctuation
remove = remove.replace("-", "") # don't remove hyphens
pattern = r"[{}]".format(remove) # create the pattern

txt = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."
re.sub(pattern, "", txt) 
# >>> 'this - is - a - test'

If performance matters, you may want to use str.translate, since it's faster than using a regex. In Python 3, the code is txt.translate({ord(char): None for char in remove}).

Community
  • 1
  • 1
Galen Long
  • 3,693
  • 1
  • 25
  • 37
0

You could either specify the punctuation you want to remove manually, as in [._,] or supply a function instead of the replacement string:

re.sub(r"\p{P}", lambda m: "-" if m.group(0) == "-" else "", text)
Cu3PO42
  • 1,403
  • 1
  • 11
  • 19
0

You could try

import re, string

text = ")*^%{}[]thi's - is - @@#!a !%%!!%- test."

exclusion_pattern = r"([{}])".format(string.punctuation.replace("-", ""))

result = re.sub(exclusion_pattern, r"", text)

print(result)

'this is a test'

jeffasante
  • 2,549
  • 2
  • 8
  • 8