0

I get 2 different results when trying to json.load a string

  • once I load it from a String in my .py code file with is utf8
  • than I load the string I put to a file with utf8 encoding

But, the length of the "text" part is different and so I can not parse a part of the string given in the entities.

What do I have to do, that also from File, I can get the same length to get the given substring at position 36 - with length 14?

import json
from io import open

line = '{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\\n@GET_THIS_USER\\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}'
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(str(text).encode('utf-8', 'replace').decode())
print("string length: " + str(len(text)))
print("Entity String = " + text[36:36+14])

print("-------------")

with open("/home/pi/telegram/phpLogs/test3.txt", 'r', encoding='utf-8', errors="surrogateescape") as f:
    for line in f:
        myjson = json.loads(line)

        text = myjson.get("message", {}).get("text", None)
        print(text)
        print("string length: " + str(len(text)))
        print("Entity String = " + text[36:36+14])

Line I put to the file

{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\n@GET_THIS_USER\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}

Result I get when running with python 3.6

✅ Offizielle Kanäle ????  ???? ????
@GET_THIS_USER
123456789
string length: 60
Entity String = @GET_THIS_USER
-------------
✅ Offizielle Kanäle    
@GET_THIS_USER
123456789
string length: 54
Entity String = HIS_USER
12345

So in the file the line is 6 characters shorter and the position is shifted that "@GET_T" is cut :(

  • Does this answer your question? [Python unicode string - position](https://stackoverflow.com/questions/61571276/python-unicode-string-position) – Błotosmętek May 18 '20 at 08:57

1 Answers1

0
  1. \u.... escape sequences in Python string literals are interpreted:

    >>> print('\u2705 Offizielle Kan\u00e4le')
    ✅ Offizielle Kanäle
    

    Meaning, your JSON string literal does not contain "backslash u ....", but already interpreted characters. To get the same effect as reading from a text file, where those character sequences aren't implicitly interpreted, you'd need to double all backslashes, or prefix your string literal with r:

    line = r'{"message" ...}'
    

    Because you don't want Python to interpret those escape sequences, but the JSON parser…

  2. Those flag emoji are above the BMP in the Unicode table. Python can encode that directly:

    >>> '\U0001f1e9\U0001f1ea'  # U+1F1E9 U+1F1EA
    ''
    

    ECMAScript operates on UTF-16 encoding and can not directly represent code points above U+FFFF, and it needs to use surrogate pairs; hence U+1F1E9 U+1F1EA encodes to \ud83c\udde9\ud83c\uddea in UTF-16/ECMAScript/JSON. Those code points are invalid/non-sensical in Python string literals though because they're not interpreted as UTF-16 surrogates.

  3. This is pretty superfluous:

    print(str(text).encode('utf-8', 'replace').decode())
    

    Just print(text). This will of course give this error message:

    UnicodeEncodeError: 'utf-8' codec can't encode characters in position 58-61: surrogates not allowed
    

    Which is a hint for the symptom described in point 2 above, which is a consequence of point 1. Your attempted workaround just hides that issue by discarding those surrogates and replacing them with ?.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • But version 1 from string inside the code like I posted works. ----> the issue comes when loading the content from the file! (version 2) ----> for me it is not important how the text line or string is printed - the goal would be that as RESULT I also get "Entity String = @GET_THIS_USER" when reading from file with pos 36 + 14 length – user3352603 May 18 '20 at 10:38
  • your "r" - version makes the results the same - thats true - but breaks the working verion from the string - if you can give me a way to read from file like it is in the string - it would be solved ;) – user3352603 May 18 '20 at 10:43
  • No, the version reading from the file is the correct version! Your string literal version is the broken one. – deceze May 18 '20 at 11:18
  • But what should I do, if my file looks like it is and the Entities are guiding me the position inside the Text to find the right postion of "mentioned"??? Even if it would be wrong - I am searching for a way to find the correct position inside the text --- if it would be neccessary to rewrite the file - I also would be fine - but the Entitiy position is fixed - so I need to find a way reading the file, like I was reading the string in my Example. – user3352603 May 18 '20 at 12:26
  • So, you're given that JSON and a string position calculated based on the UTF-16 character count and you can't change that? And that's immovable? Oh boy… You'd probably have to convert the string to UTF-16 and do your offset calculations based on that, but I won't be figuring this out for you right now. I'd advise you to open a new question which very specifically asked about exactly _that_. – deceze May 18 '20 at 12:44