I am working with a set of unicode strings and using the following piece of code (as shown in Remove punctuation from Unicode formatted strings):
import regex
def punc(text):
return regex.sub(ur"\p{P}+", " ", text)
I wanted to go one step further and try to selectively keep certain punctuations. For example -
need not be removed from the unicode string. What would be the best way to do that? Thanks in advance! :)