Remove Unicode code (\uxxx) in string Python

Question

I have some Unicode string in a document. All I want is to remove this Unicode code or replace it with some space (" "). Example =""

doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"

How do I convert it to the following?

doc = "Hello my name is Ruth! I really like swimming and dancing"

I already tried this: https://stackoverflow.com/a/20078869/5505608, but nothing happens. I'm using Python 3.

If the answer you linked didn't work, there's something you're not telling us. — Mark Ransom, May 16 '17 at 21:32
i already tried `re.sub(r'[^\x00-\x7F]+',' ', text)`. the code works, but nothing changed @MarkRansom — Fregy, May 17 '17 at 05:38
That's because strings don't update in-place, they're immutable. You need to take the return value of `re.sub` and assign it back to `text`. — Mark Ransom, May 17 '17 at 14:00

score 4 · Accepted Answer · answered May 16 '17 at 20:29

4

You can encode to ASCII and ignore errors (i.e. code points that cannot be converted to an ASCII character).

>>> doc = "Hello my name is Ruth \u2026! I really like swimming and dancing \ud83c"
>>> doc.encode('ascii', errors='ignore')
b'Hello my name is Ruth ! I really like swimming and dancing '

If the trailing whitespace bothers you, strip it off. Depending on your use case, you can decode the result again with ASCII. Chaining everything would look like this:

>>> doc.encode('ascii', errors='ignore').strip().decode('ascii')
'Hello my name is Ruth ! I really like swimming and dancing'

answered May 16 '17 at 20:29

timgeb

76,762
20
123
145

i've already tried to encode, the code works but still nothing change. thanks for your reply. – Fregy May 17 '17 at 05:33
my purpose is to clean unicode code from the tweet that i've streamed. I tried the code to my tweet.txt which is contain 10 tweets. – Fregy May 17 '17 at 05:48
which one? @timgeb – Fregy May 17 '17 at 06:15
the one in the answer. – timgeb May 17 '17 at 06:15
1

the unicode code still appears after using `tweet.encode('ascii', errors='ignore')` – Fregy May 17 '17 at 06:26

Remove Unicode code (\uxxx) in string Python

1 Answers1

Linked