Python unicode string - position

Question

I stuck with getting the position inside a String. I read the content of a file

with io.open(testfile, 'r', encoding='utf-8') as f

\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\n@GET_THIS_STING

What do I have to do - that "\u2705" is counted as 1 letter? Then Position 36 would be the start of @GET_THIS_STING

--== EDIT ==-- I can now better show whats the problem:

import json
from io import open

line = '{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\\n@GET_THIS_STING\\n123456789","entities":[{"offset":36,"length":26,"type":"mention"}]}}'
myjson = json.loads(line)
text = myjson.get("message", {}).get("text", None)
print(str(text).encode('utf-8', 'replace').decode())
print("string length: " + str(len(text)))
print(text[36:36+15])

print("-------------")

with open("/home/pi/telegram/phpLogs/test.txt", 'r', encoding='utf-8', errors="surrogateescape") as f:
    for line in f:
        myjson = json.loads(line)

        text = myjson.get("message", {}).get("text", None)
        print(text)
        print("string length: " + str(len(text)))
        print(text[36:36+15])

RESULT:

✅ Offizielle Kanäle ????  ???? ????
@GET_THIS_STING
123456789
string length: 61
@GET_THIS_STING
-------------
✅ Offizielle Kanäle    
@GET_THIS_STING123456789
string length: 54
HIS_STING123456

So when I have the string inside my code (UTF-8) as a variable (String), everything works fine. But when I create a file with content and read it

"{"message":{"message_id":3052,"text":"\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\\n@GET_THIS_STING\\n123456789","entities":[{"offset":36,"length":26,"type":"mention"}]}}"

I always receive a "wrong" result :( So reading a file is my problem, because the strings are not the same afterwards - even the length is different!

python version 3.6 on a Raspberry with Raspbian And yes - the file contains the string with \u — user3352603, May 03 '20 at 08:49

scribe · Answer 1 · 2020-05-03T21:35:51.413

0

If your file text.txt literally contains,

\u2705 Offizielle Kan\u00e4le \ud83c\udde9\ud83c\uddea  \ud83c\udde6\ud83c\uddf9 \ud83c\udde8\ud83c\udded\n@GET_THIS_STING

Try:

with open('text.txt', 'r', encoding='utf-8') as f:
    str = f.read()
    normal_str = ''
    i, n = 0, 0
    while i < len(str):
        if str[i: i + 2] == '\\u':
            i += 6
            normal_str += 'x'
        elif str[i: i + 2] == '\\n':
            i += 2
            normal_str += 'x'
        else:
            normal_str += str[i]
            i += 1
        n += 1
    print(normal_str)
    print(normal_str[36:36 + 15])

And, this outputs:

x Offizielle Kanxle xxxx  xxxx xxxxx@GET_THIS_STING

@GET_THIS_STING

With a file text.txt that looks something like this,

✅ Offizielle Kanäle    
@GET_THIS_STING

We can do,

with open('text.txt', 'r', encoding='utf-8') as f:
    str = f.read()
    index = str.find('@')
    print('char @ is at index: {}'.format(index))
    print(str[index:])

It outputs,

char @ is at index: 30
@GET_THIS_STING

edited May 03 '20 at 21:35

answered May 03 '20 at 08:06

scribe

673
2
6
17

the problem is, that I can not search for a possition - I have the position - so I have to use mystring[36:36+15] The string is a result I stored from the Telegram API - and there you have a list of Entities with "offset", "lenght" to know the position inside the text – user3352603 May 03 '20 at 08:45
If you have the string why can you not search? What exactly are the contents of your file you are reading? – scribe May 03 '20 at 21:02
I added into the Main Post 1:1 an example. Still my problem is, when reading the String from a file, I get a different result than procedding the string in the code as variable directly. So pritty sure an encoding issue. – user3352603 May 06 '20 at 19:57
Your post is not very easy to read. If you want help, then get a sample string and perform the function on it by hand and show the desired output. If I understand your problem, then the last update to my answer should solve it. It does not matter how your string is encoded as long as you don't need the data that is not encoded correctly. You want the encoding to be consistent even if wrong so you can filter out the data you really care for. – scribe May 07 '20 at 03:45

score 0 · Answer 2 · answered May 03 '20 at 09:52

0

If this string represents ✅ Offizielle Kanäle as suggested by @scribe's answer, then I think you run into the problem mentioned here: Converting to Emoji

Therefore I suggest replacing

with io.open(testfile, 'r', encoding='utf-8') as f:
    text = f.read() # you didn't show it but probably that's what you have done

with

with open(testfile, 'r', encoding='ascii') as f:
    text = json.load(f)

or, if the file is "JSON lines" rather than single JSON:

with open(testfile, 'r', encoding='ascii') as f:
    for line in f:
        text = json.loads(line)

and then text will be a proper Unicode string, so text[36:] should get you what you asked for.

answered May 03 '20 at 09:52

Błotosmętek

12,717
19
29

very close to the solution I need - it seems that the \n is now not counted. But this also should be counted - in my Example I would get with your code "GET_THIS_STING" without the "@" which is missing because of the line break. This problem shifts the results the more \n I have. For every \n the result is cut of the amount of it – user3352603 May 03 '20 at 11:48
edit: emojies seem to make Problems -- in sum my result gets missing 6 letters - so its shifted to 6 -- "IS_STING" - the \n is NOT the issue, it seems to be counted -- so the 3 flags are not counted right – user3352603 May 03 '20 at 11:57
This, in turn, seems to be a problem with surrogate characters: https://stackoverflow.com/questions/38147259/how-to-work-with-surrogate-pairs-in-python so try adding this: `text = text.encode('utf-16', 'surrogatepass').decode('utf-16')` – Błotosmętek May 03 '20 at 12:06
I get with myjson = myjson.encode('utf-16', 'surrogatepass').decode('utf-16') --- AttributeError: 'dict' object has no attribute 'encode' – user3352603 May 03 '20 at 12:21
\ud83c\udde9\ud83c\uddea should be counted as 4, but is counted as 2 because the Flag is interpreted as "DE" – user3352603 May 03 '20 at 12:47
No no no, you need to do it on a string, not a dict. Check where in `myjson` your string is… probably as a value for some key. – Błotosmętek May 03 '20 at 12:47

Python unicode string - position

2 Answers2

Linked