4

I have the following text in a json file:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

which represents the text "אחוזת פולג" in Hebrew.

no matter which encoding/decoding i use i don't seem to get it right with Python 3.

if for example ill try:

text = "\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa 
\u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092".encode('unicode-escape')

print(text)

i get that text is:

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

which in bytecode is almost the correct text, if i was able to remove only one backslash and turn

b'\\xd7\\x90\\xd7\\x97\\xd7\\x95\\xd7\\x96\\xd7\\xaa \\xd7\\xa4\\xd7\\x95\\xd7\\x9c\\xd7\\x92'

into

text = b'\xd7\x90\xd7\x97\xd7\x95\xd7\x96\xd7\xaa \xd7\xa4\xd7\x95\xd7\x9c\xd7\x92'

(note how i changed double slash to single slash) then

text.decode('utf-8')

would yield the correct text in Hebrew.

but i am struggling to do so and couldn't manage to create a piece of code which will do that for me (and not manually as i just showed...)

any help much appreciated...

Limitless
  • 205
  • 2
  • 6
  • Can you send it back? Ask for "אחוזת פולג" or "\u05D0\u05D7\u05D5\u05D6\u05EA\u0020\u05E4\u05D5\u05DC\u05D2" in the JSON document. – Tom Blodget Sep 22 '18 at 16:39
  • Take a look at this: [Facebook JSON badly encoded](https://stackoverflow.com/questions/50008296/facebook-json-badly-encoded). – Andrey Tyukin Sep 22 '18 at 23:48

1 Answers1

5

This string does not "represent" Hebrew text (at least not as unicode code points, UTF-16, UTF-8, or in any well-known way at all). Instead, it represents a sequence of UTF-16 code units, and this sequence consists mostly of multiplication signs, currency signs, and some weird control characters.

It looks like the original character data has been encoded and decoded several times with some strange combination of encodings.

Assuming that this is what literally is saved in your JSON file:

"\u00d7\u0090\u00d7\u0097\u00d7\u0095\u00d7\u0096\u00d7\u00aa \u00d7\u00a4\u00d7\u0095\u00d7\u009c\u00d7\u0092"

you can recover the Hebrew text as follows:

(jsonInput
  .encode('latin-1')
  .decode('raw_unicode_escape')
  .encode('latin-1')
  .decode('utf-8')
)

For the above example, it gives:

'אחוזת פולג'

If you are using a JSON deserializer to read in the data, then you should of course omit the .encode('latin-1').decode('raw_unicode_escape') steps, because the JSON deserializer would already interpret the escape sequences for you. That is, after the text element is loaded by JSON deserializer, it should be sufficient to just encode it as latin-1 and then decode it as utf-8. This works because latin-1 (ISO-8859-1) is an 8-bit character encoding that corresponds exactly to the first 256 code points of unicode, whereas your strangely broken text encodes each byte of UTF-8 encoding as an ASCII-escape of an UTF-16 code unit.

I'm not sure what you can do if your JSON contains both the broken escape sequences and valid text at the same time, it might be that the latin-1 doesn't work properly any more. Please don't apply this transformation to your JSON file unless the JSON itself contains only ASCII, it would only make everything worse.

Andrey Tyukin
  • 43,673
  • 4
  • 57
  • 93
  • thanks for the detailed explanation. my json file contains both plain english text and both \u00xxx type of text ('representing' hebrew). i have no way to distinct between english and hebrew parts of the text in advance... any idea how i can handle this? – Limitless Sep 22 '18 at 17:04
  • @Limitless I think that if your plain English text is strictly in the ASCII 0-127 range, it could actually still work, because it would simply pass through all encoding-decoding stages unchanged. Can you come up with an example where it doesn't work? – Andrey Tyukin Sep 22 '18 at 17:07
  • it's all data from facebook posts/pages/comments, so i guess there are not many special characters... in case i will encounter such character i'll update... Thanks!!! – Limitless Sep 22 '18 at 18:01
  • @Limitless I don't see any reason to assume that data from facebook posts doesn't have *every* single kind of weird characters in it. In the (rather probable) case that you find any characters outside of usual ascii-range there, which aren't encoded in the same "weird" format as the text in your original question, I'd suggest to investigate why the data arrives in such a broken format in the first place, instead of trying to reconstruct the original meaning from the already broken text. – Andrey Tyukin Sep 22 '18 at 18:40