I get 2 different results when trying to json.load a string
- once I load it from a String in my .py code file with is utf8
- than I load the string I put to a file with utf8 encoding
But, the length of the "text" part is different and so I can not parse a part of the string given in the entities.
What do I have to do, that also from File, I can get the same length to get the given substring at position 36 - with length 14?
import json
from io import open
line = '{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\\n@GET_THIS_USER\\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}'
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(str(text).encode('utf-8', 'replace').decode())
print("string length: " + str(len(text)))
print("Entity String = " + text[36:36+14])
print("-------------")
with open("/home/pi/telegram/phpLogs/test3.txt", 'r', encoding='utf-8', errors="surrogateescape") as f:
for line in f:
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(text)
print("string length: " + str(len(text)))
print("Entity String = " + text[36:36+14])
Line I put to the file
{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\n@GET_THIS_USER\n123456789","entities":[{"offset":36,"length":14,"type":"mention"}]}}
Result I get when running with python 3.6
✅ Offizielle Kanäle ???? ???? ????
@GET_THIS_USER
123456789
string length: 60
Entity String = @GET_THIS_USER
-------------
✅ Offizielle Kanäle
@GET_THIS_USER
123456789
string length: 54
Entity String = HIS_USER
12345
So in the file the line is 6 characters shorter and the position is shifted that "@GET_T" is cut :(