-4

I have a string which contains both English and Arabic and now I need to remove special characters.

I know there exist a regex solution:

re.sub('[^A-Za-z0-9]+', '', mystring)

but this regex is also removing Arabic letters from the string.

Mohit Rajpoot
  • 101
  • 10
  • 1
    By "special characters", do you mean everything non English/Arabic characters such as punctuations, symbols, Japanese characters, etc.? Do you mind providing an input and an expected output? – Taku Apr 18 '18 at 06:31
  • @ChickenFeet special characters means only punctuations and symbols. – Mohit Rajpoot Apr 18 '18 at 06:36
  • I reopened this question because OP tried exactly what's mentioned in duplicate question. This is a different question/issue that needs to be addressed in another way. Please mention any true duplicate if you know so that we can close it properly. And, please use your privilege of closing questions properly and don't cause confusion to (future) users. – Mazdak Apr 18 '18 at 06:51
  • For those who are downvoting this issue, I have looked at https://stackoverflow.com/questions/5843518/remove-all-special-characters-punctuation-and-spaces-from-string/5844618#5844618 but there was not any clarification about the answer given to this question and that answer was also for python2. so I asked this question to clarify whether that solution is accepted in python3 or not. – Mohit Rajpoot Apr 18 '18 at 07:00
  • @Kasramvd Please close it back again, the answer is there, in https://stackoverflow.com/a/5844618/3832970. It works in both Python 2 and 3. – Wiktor Stribiżew Apr 18 '18 at 07:25
  • @WiktorStribiżew That's not a valid reason to mark it as a duplicate I found a more closer dup here tho https://stackoverflow.com/questions/11066400/remove-punctuation-from-unicode-formatted-strings/11066687 – Mazdak Apr 18 '18 at 07:38
  • Closed. @Kasramvd ^ the solution in duplicate is cleaner than your answer. – Antti Haapala -- Слава Україні Apr 18 '18 at 07:39
  • @Kasramvd you've been awarded the gold badge. Use it and use it for *good* - *edit* the duplicates to add that question! – Antti Haapala -- Слава Україні Apr 18 '18 at 07:40
  • 1
    @AnttiHaapala Clean is an abstract idea and it's relative. I also proposed two separates solutions which the first one is more optimized than what you call clean. Besides, next time you wanted to say clean elaborate on that and prove how this is cleaner. Is it cleaner only from your perspective or you mean everybody? and more importantly how? Also, cleanness is not always a factor please compare them by other factors such as memory usage, exec runtime, scalability, etc. – Mazdak Apr 18 '18 at 07:55
  • @Kasramvd the solution in the duplicate `'[\W_]+'` uses less substitutions when there are runs of non-word characters with `_`. – Antti Haapala -- Слава Україні Apr 18 '18 at 08:04

1 Answers1

1

If underline (_) is not among your special characters, one clean way around this is using word characters modifier along with a unicode flag (In python-3 strings are unicodes and you don't need unicode flag).

In [10]: s = "#$&%NKGS&$@023489_7نسیتلبskdjfh3%-"

In [11]: re.sub('[^\w]+', '', s, flags=re.U)
Out[11]: 'NKGS023489_7نسیتلبskdjfh3'

If it's not you can also include that like following:

In [12]: re.sub('[^\w]+|_', '', s, flags=re.U)
Out[12]: 'NKGS0234897نسیتلبskdjfh3'
Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • `re.sub(r'[\W_]+', '', s)` is enough in Python 3 (`re.U` is implied). Actually, it is the same as https://stackoverflow.com/a/5844618/3832970 – Wiktor Stribiżew Apr 18 '18 at 06:36
  • @WiktorStribiżew Yes. But still I think tat explicit is better than implicit. In this case both for the regex format and the unicode flag. But I'll still add that as a side note to the answer, thanks for note. – Mazdak Apr 18 '18 at 06:40