0

I want to remove all words that begin with \u. I believe these are unicode '\uXXXX'.

The original string:

"RT  \u2066als \u2066@WBHoekstra\u2069 zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"

Desired output:

"RT @WBHoekstra zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"

I tried using regex like so:

re.sub('\u\w+','',item )

But I get the following error:

"SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 0-1: truncated \uXXXX escape"
blah
  • 674
  • 3
  • 17
  • 1
    You can't match `\u\w+` as there are no such patterns in your string, that is `RT ⁦als ⁦@WBHoekstra⁩ zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '`, see https://rextester.com/MDMRR93300. If you need to remove these chars, just use `item=item.replace('\u2066','').replace('\u2069','')` – Wiktor Stribiżew Apr 29 '20 at 14:26
  • 1
    Does this answer your question? [Easiest way to remove unicode representations from a string in python 3?](https://stackoverflow.com/questions/13793973/easiest-way-to-remove-unicode-representations-from-a-string-in-python-3) – GiftZwergrapper Apr 29 '20 at 14:28

1 Answers1

0

you can do this by using .encode('ascii', 'ignore')

"RT  \u2066als \u2066@WBHoekstra\u2069 zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '".encode('ascii', 'ignore')

output

 b"RT  als @WBHoekstra zijn poot maar stijf houdt in de Italiaanse kwestie. Leest Mattheus 25, 2-13 '"
GiftZwergrapper
  • 2,602
  • 2
  • 20
  • 40
Beny Gj
  • 607
  • 4
  • 16
  • This worked, thanks. But, is there a possibility to remove that 'b' at the beginning of the string ? I am preprocessing some tweets and I cannot have that 'b' in there. I just tried removing it using a regex but it does not work. – blah Apr 29 '20 at 14:35