2

I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):

import regex

def punc(text):
    return regex.sub(ur"\p{P}+", " ", text)

I wanted to go one step further and try to selectively keep certain punctuations. For example - need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)

Community
  • 1
  • 1

2 Answers2

3

You can negate the \p{P} with \P{P} then put it in a negated character class ([^…]) along with whatever characters you want to keep, like this:

return regex.sub(ur"[^\P{P}-]+", " ", text)

This will match one or more of any character in \p{P} except those that are also defined inside the character class.

Remember that - is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.


Another solution would be to use a negative lookahead ((?!…)) or negative lookbehind ((?<!…))

return regex.sub(ur"((?!-)\p{P})+", " ", text)

return regex.sub(ur"(\p{P}(?<!-))+", " ", text)

But for something like this I'd recommend the character class instead.

p.s.w.g
  • 146,324
  • 30
  • 291
  • 331
1

You can use a character class for this:

def punc(text):
    return regex.sub(ur"[^\P{P}-]+", " ", text)

The trick is to negate the character class ([^a] matches anything except a) and to use the negated Unicode properties:

  • We replace \p{P} by [^\P{P}] - both behave exactly the same.
  • Now we can add characters to the class that should not be matched: [^\P{P}-] matches any punctuation character except -.
Tim Pietzcker
  • 328,213
  • 58
  • 503
  • 561