EDIT: I've tried the one solution in the linked duplicate questions. the answer using regex
and regex.findall(r'\X', tweet)
was working. However, I've just updated the regex
package, and now it fails. No idea why so far.
I'm trying to correctly extract all emojis from tweets. My problem is that some emojis require one codepoint, while other emojis (e.g., country flags) require two codepoints. For example:
tweet = RT @CreepyHorrorGal: • ☠ #creepy #horror
This is what I see when I print the tweet in Python. However, when I do:
for c in tweet:
print(c)
I shows me:
...
a
l
:
•
☠
...
Here, the two codepoints of the flag get separated and interpreted separately. Also the following code
tweet.encode('utf-16', 'surrogatepass').decode('utf-16').encode("raw_unicode_escape").decode("latin_1")
gives me:
RT @CreepyHorrorGal: \u2022 \U0001f453\U0001f1ec\U0001f1e7\u2620\U0001f632\U0001f922 #creepy #horror
In principle, I understand all outputs. But I wonder, I does the browser (Jupyter notebook) know that \U0001f1ec\U0001f1e7
is one emoji requiring two codepoints, particularly given that it is followed and succeeded by other emojis with no whitespace in between.
And how can I reliably extract all emojis correctly? Right now I use a simple regex, but it works only with single codepoints, i.e., I "destroy" flag. How can I solve this?