13

I found several topics of this and I found this solution:

sentence=re.sub(ur"[^\P{P}'|-]+",'',sentence)

This should remove every punctuation except ', the problem is it also strips everything else from the sentence.

Example:

>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> sentence=re.sub(ur"[^\P{P}']+",'',sentence)
>>> print sentence
'

of course what I want is to keep the sentence without punctuation, and "warhol's" stays as is

Desired output:

"warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music"
"austro-hungarian empire"

Edit: I also tried using

tbl = dict.fromkeys(i for i in xrange(sys.maxunicode)
    if unicodedata.category(unichr(i)).startswith('P')) 
sentence = sentence.translate(tbl)

but this strips every punctuation

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
KameeCoding
  • 693
  • 2
  • 9
  • 27
  • [here](http://stackoverflow.com/questions/21209024/python-regex-remove-all-punctuation-except-hyphen-for-unicode-string) it says it should everything that is punctuation except ' – KameeCoding Apr 28 '15 at 21:45
  • Oops, you are correct; not that versed in the new `regex` module constructs. – Martijn Pieters Apr 28 '15 at 21:48

1 Answers1

17

Specify all the elements you don't want removed, i.e. \w, \d, \s, etc. This is what the ^ operator means with in square brackets. (matches anything except)

>>> import re
>>> sentence="warhol's art used many types of media, including hand drawing, painting, printmaking, photography, silk screening, sculpture, film, and music."
>>> print re.sub(ur"[^\w\d'\s]+",'',sentence)
warhol's art used many types of media including hand drawing painting printmaking photography silk screening sculpture film and music
>>> 
C.B.
  • 8,096
  • 5
  • 20
  • 34