0

I'm scraping some data from a website and using regex i was able to extract some strings in UTF-16 format. Using this site I'm able to decode the strings i extract but i want to do it all in Python.
The extracted text is in String format, not bytes. So a simple .encode() doesn't work.
For example:
String: \u0074\u0065\u0073\u0074 --> String: test
I can think of solving this by treating the string as a byte object, but i have no idea how to do this.

EDIT: The data chunk i've extracted from using regex:

I = new Array();

I[0] = new Array();
I[0][1] = new Array();
I[0][1][0] = new Array();
I[0][1][0][0] = '\u0074\u0065\u0073\u0074';
I[0][2]='';

Any help is appreciated.
Thanks

Xosrov
  • 719
  • 4
  • 22
  • I couldn't find any duplicates to this question so if you find one kindly link it here thanks – Xosrov Feb 26 '20 at 12:03
  • 1
    Unicode is readable text. The page you read right now is Unicode (UTF8 specifically) which is why I can write Αυτό Εδώ and know it will appear without any issues. What you posted is escape sequences – Panagiotis Kanavos Feb 26 '20 at 12:09
  • Why do you assume there's any kind of encoding or decoding involved? Where did you see those escape sequences, how was that string constructed? It could be that the escape sequences are just the way the debugger displays those characters – Panagiotis Kanavos Feb 26 '20 at 12:11
  • @PanagiotisKanavos It's constructed with Javascript in the webpage. I'm not too familiar with js, but still i'll edit the post with more info – Xosrov Feb 26 '20 at 12:14
  • Javascript strings are UTF16, just like Java and C# (which is used in SO). No encoding is needed. What you posted is just escape sequences. In fact, if I just type `'\u0074\u0065\u0073\u0074'` in Python3 I get `test` in the console. If I type `'\u0074\u0065\u0073\u0074'=='test'` I get `True` – Panagiotis Kanavos Feb 26 '20 at 13:03
  • The duplicates are unrelated. The string *doesn't* contain double backslashes at all. There's nothing to convert or decode – Panagiotis Kanavos Feb 26 '20 at 13:05
  • @Xorov if what you posted is the *actual* string, there's nothing to encode or decode - whatever tool you use is showing the escape sequences instead of the actual characters. You'd have a problem only if you saw double backslashes. This would mean that the character was replaced by the characters in the escape sequence itself, eg instead of `t`, the string contained 6 characters, `\ `, `u`, etc, with the backslash escaped – Panagiotis Kanavos Feb 26 '20 at 13:14
  • @Xorov what do you get if you use `len(thatString)`? 4? Or 24? If the string contained the encoded sequences, you'd get 6 characters per original character – Panagiotis Kanavos Feb 26 '20 at 13:15

1 Answers1

0

If you don't care about security, the simplest way is eval:

eval('"' + yourstringhere + '"')

But the correct way to do this would be:

x = bytes(yourstringhere, "utf-8").decode("unicode_escape")

the variable yourstringhere should contain the "String" object you mentioned before running this, obviously.

At first we convert the string to bytes using the bytes function. Then we decode using a special encoding called unicode_escape which parses the \u.... sequences into actual unicode characters

Omer Tuchfeld
  • 2,886
  • 1
  • 17
  • 24