1

I have the following code

import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r"[-()\"#/@;:<>{}`+=&~|.!?,^]", " ", oldstr)
print(newstr)

The above code does not work.

Current result "HR Director, LearningÂ"

Expected result "HR Director, Learning"

How to achieve this ?

user10083444
  • 105
  • 1
  • 1
  • 10
  • 1
    So, why not add `Â` to the character class? Or, better, fix the encoding issue. – Wiktor Stribiżew Apr 02 '20 at 19:25
  • 3
    Use: `re.sub(r'[^\x00-\x7f]+|[-()"#/@;:<>{}\`+=&~|.!?,^]+', "", oldstr)` where `[^\x00-\x7f]` will match all non-ASCII characters – anubhava Apr 02 '20 at 19:26
  • 1
    Does this answer your question? [What is the best way to remove accents in a Python unicode string?](https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string) – XPhyro Apr 02 '20 at 19:27
  • 1
    Thanks @WiktorStribiżew . Your solutions worked. – user10083444 Apr 02 '20 at 19:33
  • 1
    Thanks @anubhava your solution is elegant too. – user10083444 Apr 02 '20 at 19:34
  • 1
    @XPhyro I got my answer here. Thanks for pointing me to the link. The solution mentioned there didn work for me for some reason. I had a look at it before posting my question. – user10083444 Apr 02 '20 at 19:35

2 Answers2

1

Converting my comment to answer so that solution is easy to find for future visitors.

You may use:

import re
oldstr="HR Director, LearningÂ"
newstr = re.sub(r'[^\x00-\x7f]+|[-()"#/@;:<>{}`+=&~|.!?,^]+', "", oldstr)
print(newstr)

Output:

HR Director Learning

[^\x00-\x7f] will match all non-ASCII characters.

anubhava
  • 761,203
  • 64
  • 569
  • 643
0

You can use this method too:

def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i)<128)

Here's how my piece of code outputs:

s = "HR Director, LearningÂ"
def _removeNonAscii(s): 
    return "".join(i for i in s if ord(i)<128)

print(_removeNonAscii(s))

Output:

HR Director, Learning