How to get substring of content with emoji in (in python)

Question

Take the string:

" @train1 hello there"

I have the location of the @train1 in the string with an offset and length.

{
 offset: 3
 length: 7 
}

I try to get the substring from the original string with:

sub_str = msg[offset: offset + length]

However the emoji is counting as 2 chars in python so i getting:

"train1 "

instead of

"@train1"

Is there a way to get sub-strings with multi-byte characters?

If I understand problem correctly, `msg[offset - 1: offset + length -1]` will return what you expect — rzlvmp, Aug 03 '22 at 00:32
not necessarily because the code works fine if there are no any emojis before the @. Also if i had two emojis like: ` @train1 asdasdasd` The offset would be 5. The offset is calculated in JS, and then passed in a post request. — Aaron Nebbs, Aug 03 '22 at 00:35

Talon · Answer 1 · 2022-08-03T00:55:25.510

In its string form, will be one character.

In : msg[0]
Out: ''
In : msg[1]
Out: ' '
In : msg[2]
Out: '@'
In : msg[3]
Out: 't'
In : msg[3:3+7]
Out: 'train1 '

You might be having an off-by-one error in your slicing then, as your token starts with your caret at index 2, between the and @. If your offset and length data are static, you might want to subtract 1 off the offset.

Some discussion after a comment:

It seems like you get the indexes from another source and the message does not necessarily contain one emoji, in that case this can be very non-trivial, considering there are mutli-character emojis when modifiers are active, (e.g. ‍‍‍ which is 7 codepoints and 25 bytes in UTF8), symbols that use non-ascii-characters, etc. And then it depends again on how your data source is interpreting those.

You could get a list of emojis (e.g. the emoji module), lookup if characters in your message are an emoji and if so, duplicate them so your indexes fit. This will however cause trouble if that emoji is in the part you want to slice out.

On the other hand, if it's the token @trains you want, and in other messages you want other tokens like @token, you could discard the offset information and just look for words that start with @

score 0 · Answer 2 · answered Aug 03 '22 at 19:16

If your data is coming from a program that uses unicode graphemes, you could use the regex library to split the string into graphemes which are grouped under \X and then use your offsets on the resulting list of graphemes:

import regex

msg = " @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

msg = "‍‍‍ @train1 hello there"
graphemes = regex.findall(r'\X', msg)
print(graphemes)
# ['‍‍‍', ' ', '@', 't', 'r', 'a', 'i', 'n', '1', ' ', 'h', 'e', 'l', 'l', 'o', ' ', 't', 'h', 'e', 'r', 'e']

It's not the same, `list()` splits the string at each character. Try it out with the strings from my example code — cuzi, Aug 04 '22 at 07:46

score 0 · Answer 3 · answered Aug 04 '22 at 14:52

Okay here is little bit dirty way, but maybe it will help you find better solution:

Let's suppose that we have

string = " 123"

Where
Javascript output is: string[3] → 1
Python output is: string[3] → 2

Why it happens?

Python determining emoji like one character, but Javascript like two.

Let's see how this string looking in Javascript in escaped form:

import json

print(json.dumps(string).strip('"'))

And output will be:

# raw string will be looks like '\\ud83d\\udcd9 123'. \\ (escaped \) means that \u is not a UTF character but usual string starting with \u
\ud83d\udcd9 123

If you will try to paste this line into browser's console you will get emoji.

So if we replace \u1234 with X for example, the string length will be same as Javascript counting. Let's do it with regex:

import json
import re

new_string = re.sub('\\\\u[0-9a-f].{3}', 'X', json.dumps(string).strip('"'))
print(new_string)

And output will be XX 123, aaand voila new_string[3] will be 1. Same as Javascript.

But be carefull, this solution replace all UTF-8 bytes to X. Only ASCII characters may be parsed by this way.

Some info that may help you: 1, 2, 3

If you able to edit Javascript side, I recommend to use var chars = Array.from(string). That will allow to generate correct sequence of characters: [ "", " ", "1", "2", "3" ]

How to get substring of content with emoji in (in python)

3 Answers3