0

I have tried below piece of code to remove punctuation from a string.

import re
s = "string. With. Punctuation?"
s = re.sub(r'[^\w\s]','',s)

This works fine for roman like text(script) but seems to have problem with Unicode like text like Hindi, Telugu etc.

for example:

import re
s = "అనేది దేనికి సమానం అవుతుంది."
s = re.sub(r'[^\w\s]','',s)

This one completely changes the text itself and making it not understandable by removing dependent vowels of that script.

So my question is how can I remove punctuation from text that is other than roman text.

The duplicate question linked will replace punctuation for roman like string as I already mentioned. My issue here is to replace punctuation for Unicode like string. There is a clear difference not a duplicate.

Nagaraju
  • 1,853
  • 2
  • 27
  • 46
  • You do not have to rely on `[^\w\s]` that matches any chars but word and whitespace, there are other ways. See [this answer](https://stackoverflow.com/a/18570395/3832970) in the linked thread – Wiktor Stribiżew Oct 31 '19 at 10:59
  • @Wiktor Stribiżew , but how is this question exact duplicate of that one? – Nagaraju Oct 31 '19 at 11:00
  • 1
    Because it has the answer with a solution you may use. – Wiktor Stribiżew Oct 31 '19 at 11:14
  • What about [this](https://stackoverflow.com/a/7268456/3270037) answer, or [this](https://stackoverflow.com/a/39901522/3270037) answer? Did you search for unicode answers on that question? – Nick is tired Oct 31 '19 at 11:15
  • okay may be those answers work? but this question is not a duplicate right? – Nagaraju Oct 31 '19 at 11:17
  • 3
    Of course it is, it's the same question and has answers that answer your question. Remember there's *nothing wrong* with a question being a duplicate, it's a signpost for future users that their answer might be somewhere else – Nick is tired Oct 31 '19 at 11:18
  • There is another [relevant thread](https://stackoverflow.com/questions/33787354/strip-special-characters-and-punctuation-from-a-unicode-string). If you can use PyPi regex module, use `r'[\p{P}\p{S}]+'` or `r'\p{P}+'` pattern. – Wiktor Stribiżew Oct 31 '19 at 11:21

0 Answers0