3

I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"

How do I convert it to the following?

doc = "Hello my name is Ruth! I really like swimming and dancing"

I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.

Community
  • 1
  • 1
Fregy
  • 111
  • 1
  • 7
  • If the answer you linked didn't work, there's something you're not telling us. – Mark Ransom May 16 '17 at 21:32
  • i already tried `re.sub(r'[^\x00-\x7F]+',' ', text)`. the code works, but nothing changed @MarkRansom – Fregy May 17 '17 at 05:38
  • That's because strings don't update in-place, they're immutable. You need to take the return value of `re.sub` and assign it back to `text`. – Mark Ransom May 17 '17 at 14:00

1 Answers1

4

You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '

If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'
timgeb
  • 76,762
  • 20
  • 123
  • 145