JSON.loads encoding issues - File vs. String

Question

I get 2 different results when trying to json.load a string

once I load it from a String in my .py code file with is utf8
than I load the string I put to a file with utf8 encoding

But, the length of the "text" part is different and so I can not parse a part of the string given in the entities.

What do I have to do, that also from File, I can get the same length to get the given substring at position 36 - with length 14?

import json
from io import open

line = '{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\\n@GET_THIS_USER\\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}'
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(str(text).encode('utf-8', 'replace').decode())
print("string length: " + str(len(text)))
print("Entity String = " + text[36:36+14])

print("-------------")

with open("/home/pi/telegram/phpLogs/test3.txt", 'r', encoding='utf-8', errors="surrogateescape") as f:
    for line in f:
        myjson = json.loads(line)

        text = myjson.get("message", {}).get("text", None)
        print(text)
        print("string length: " + str(len(text)))
        print("Entity String = " + text[36:36+14])

Line I put to the file

{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\n@GET_THIS_USER\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}

Result I get when running with python 3.6

✅ Offizielle Kanäle ????  ???? ????
@GET_THIS_USER
123456789
string length: 60
Entity String = @GET_THIS_USER
-------------
✅ Offizielle Kanäle    
@GET_THIS_USER
123456789
string length: 54
Entity String = HIS_USER
12345

So in the file the line is 6 characters shorter and the position is shifted that "@GET_T" is cut :(

Does this answer your question? [Python unicode string - position](https://stackoverflow.com/questions/61571276/python-unicode-string-position) — Błotosmętek, May 18 '20 at 08:57

score 0 · Answer 1 · answered May 18 '20 at 09:25

0

\u.... escape sequences in Python string literals are interpreted:
```
>>> print('\u2705 Offizielle Kan\u00e4le')
✅ Offizielle Kanäle
```
Meaning, your JSON string literal does not contain "backslash u ....", but already interpreted characters. To get the same effect as reading from a text file, where those character sequences aren't implicitly interpreted, you'd need to double all backslashes, or prefix your string literal with r:
```
line = r'{"message" ...}'
```
Because you don't want Python to interpret those escape sequences, but the JSON parser…
Those flag emoji are above the BMP in the Unicode table. Python can encode that directly:
```
>>> '\U0001f1e9\U0001f1ea'  # U+1F1E9 U+1F1EA
''
```
ECMAScript operates on UTF-16 encoding and can not directly represent code points above U+FFFF, and it needs to use surrogate pairs; hence U+1F1E9 U+1F1EA encodes to \ud83c\udde9\ud83c\uddea in UTF-16/ECMAScript/JSON. Those code points are invalid/non-sensical in Python string literals though because they're not interpreted as UTF-16 surrogates.
This is pretty superfluous:
```
print(str(text).encode('utf-8', 'replace').decode())
```
Just print(text). This will of course give this error message:
```
UnicodeEncodeError: 'utf-8' codec can't encode characters in position 58-61: surrogates not allowed
```
Which is a hint for the symptom described in point 2 above, which is a consequence of point 1. Your attempted workaround just hides that issue by discarding those surrogates and replacing them with ?.

answered May 18 '20 at 09:25

deceze

510,633
85
743
889

But version 1 from string inside the code like I posted works. ----> the issue comes when loading the content from the file! (version 2) ----> for me it is not important how the text line or string is printed - the goal would be that as RESULT I also get "Entity String = @GET_THIS_USER" when reading from file with pos 36 + 14 length – user3352603 May 18 '20 at 10:38
your "r" - version makes the results the same - thats true - but breaks the working verion from the string - if you can give me a way to read from file like it is in the string - it would be solved ;) – user3352603 May 18 '20 at 10:43
No, the version reading from the file is the correct version! Your string literal version is the broken one. – deceze May 18 '20 at 11:18
But what should I do, if my file looks like it is and the Entities are guiding me the position inside the Text to find the right postion of "mentioned"??? Even if it would be wrong - I am searching for a way to find the correct position inside the text --- if it would be neccessary to rewrite the file - I also would be fine - but the Entitiy position is fixed - so I need to find a way reading the file, like I was reading the string in my Example. – user3352603 May 18 '20 at 12:26
So, you're given that JSON and a string position calculated based on the UTF-16 character count and you can't change that? And that's immovable? Oh boy… You'd probably have to convert the string to UTF-16 and do your offset calculations based on that, but I won't be figuring this out for you right now. I'd advise you to open a new question which very specifically asked about exactly _that_. – deceze May 18 '20 at 12:44

JSON.loads encoding issues - File vs. String

1 Answers1