Unable to remove accented special characters in a string despite using regex

Question

I have the following code

import re
oldstr="HRÂ Director,Â LearningÂ"
newstr = re.sub(r"[-()\"#/@;:<>{}`+=&~|.!?,^]", " ", oldstr)
print(newstr)

The above code does not work.

Current result "HRÂ Director,Â LearningÂ"

Expected result "HR Director, Learning"

How to achieve this ?

So, why not add `Â` to the character class? Or, better, fix the encoding issue. — Wiktor Stribiżew, Apr 02 '20 at 19:25
Use: `re.sub(r'[^\x00-\x7f]+|[-()"#/@;:<>{}\`+=&~|.!?,^]+', "", oldstr)` where `[^\x00-\x7f]` will match all non-ASCII characters — anubhava, Apr 02 '20 at 19:26
Does this answer your question? [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) — XPhyro, Apr 02 '20 at 19:27
@XPhyro I got my answer here. Thanks for pointing me to the link. The solution mentioned there didn work for me for some reason. I had a look at it before posting my question. — user10083444, Apr 02 '20 at 19:35

score 1 · Accepted Answer · answered Apr 02 '20 at 19:37

Converting my comment to answer so that solution is easy to find for future visitors.

You may use:

import re
oldstr="HRÂ Director,Â LearningÂ"
newstr = re.sub(r'[^\x00-\x7f]+|[-()"#/@;:<>{}`+=&~|.!?,^]+', "", oldstr)
print(newstr)

Output:

HR Director Learning

[^\x00-\x7f] will match all non-ASCII characters.

score 0 · Answer 2 · answered Apr 02 '20 at 20:04

You can use this method too:

def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i)<128)

Here's how my piece of code outputs:

s = "HRÂ Director,Â LearningÂ"
def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i)<128)

print(_removeNonAscii(s))

Output:

HR Director, Learning

2 Answers2