Remove selected punctuation from unicode strings

Question

I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):

import regex

def punc(text):
    return regex.sub(ur"\p{P}+", " ", text)

I wanted to go one step further and try to selectively keep certain punctuations. For example - need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)

p.s.w.g · Accepted Answer · 2014-07-08T16:46:17.240

You can negate the \p{P} with \P{P} then put it in a negated character class ([^…]) along with whatever characters you want to keep, like this:

return regex.sub(ur"[^\P{P}-]+", " ", text)

This will match one or more of any character in \p{P} except those that are also defined inside the character class.

Remember that - is a special character within a character class. If it doesn't appear at the start or end of the character class, you'll probably need to escape it.

Another solution would be to use a negative lookahead ((?!…)) or negative lookbehind ((?<!…))

return regex.sub(ur"((?!-)\p{P})+", " ", text)

return regex.sub(ur"(\p{P}(?<!-))+", " ", text)

But for something like this I'd recommend the character class instead.

score 1 · Answer 2 · answered Jul 08 '14 at 16:36

You can use a character class for this:

def punc(text):
    return regex.sub(ur"[^\P{P}-]+", " ", text)

The trick is to negate the character class ([^a] matches anything except a) and to use the negated Unicode properties:

We replace \p{P} by [^\P{P}] - both behave exactly the same.
Now we can add characters to the class that should not be matched: [^\P{P}-] matches any punctuation character except -.

Remove selected punctuation from unicode strings

2 Answers2