How do I remove \uXXXX from a list of strings?

Question

I want to remove all words that begin with \u. I believe these are unicode '\uXXXX'.

The original string:

"RT  \u2066als \u2066@WBHoekstra\u2069 zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"

Desired output:

"RT @WBHoekstra zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"

I tried using regex like so:

re.sub('\u\w+','',item )

But I get the following error:

"SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape"

You can't match `\u\w+` as there are no such patterns in your string, that is `RT ⁦als ⁦@WBHoekstra⁩ zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '`, see https://rextester.com/MDMRR93300. If you need to remove these chars, just use `item=item.replace('\u2066','').replace('\u2069','')` — Wiktor Stribiżew, Apr 29 '20 at 14:26
Does this answer your question? [Easiest way to remove unicode representations from a string in python 3?](https://stackoverflow.com/questions/13793973/easiest-way-to-remove-unicode-representations-from-a-string-in-python-3) — GiftZwergrapper, Apr 29 '20 at 14:28

score 0 · Answer 1 · edited Apr 29 '20 at 14:30

0

you can do this by using .encode('ascii', 'ignore')

"RT  \u2066als \u2066@WBHoekstra\u2069 zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '".encode('ascii', 'ignore')

output

 b"RT  als @WBHoekstra zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"

edited Apr 29 '20 at 14:30

GiftZwergrapper

answered Apr 29 '20 at 14:27

Beny Gj

This worked, thanks. But, is there a possibility to remove that 'b' at the beginning of the string ? I am preprocessing some tweets and I cannot have that 'b' in there. I just tried removing it using a regex but it does not work. – blah Apr 29 '20 at 14:35

1 Answers1