1

EDIT: I've tried the one solution in the linked duplicate questions. the answer using regex and regex.findall(r'\X', tweet) was working. However, I've just updated the regex package, and now it fails. No idea why so far.

I'm trying to correctly extract all emojis from tweets. My problem is that some emojis require one codepoint, while other emojis (e.g., country flags) require two codepoints. For example:

tweet = RT @CreepyHorrorGal: • ☠ #creepy #horror

This is what I see when I print the tweet in Python. However, when I do:

for c in tweet:
    print(c)

I shows me:

...
a
l
:

•




☠


...

Here, the two codepoints of the flag get separated and interpreted separately. Also the following code

tweet.encode('utf-16', 'surrogatepass').decode('utf-16').encode("raw_unicode_escape").decode("latin_1")

gives me:

RT @CreepyHorrorGal: \u2022 \U0001f453\U0001f1ec\U0001f1e7\u2620\U0001f632\U0001f922 #creepy #horror

In principle, I understand all outputs. But I wonder, I does the browser (Jupyter notebook) know that \U0001f1ec\U0001f1e7 is one emoji requiring two codepoints, particularly given that it is followed and succeeded by other emojis with no whitespace in between.

And how can I reliably extract all emojis correctly? Right now I use a simple regex, but it works only with single codepoints, i.e., I "destroy" flag. How can I solve this?

Christian
  • 3,239
  • 5
  • 38
  • 79
  • I don't even see a flag in both examples, only the single characters "G" and "B". – Matthias Nov 20 '19 at 07:40
  • @Matthias Firefox shows me the flag in the example. But I assume that it's not unlikely that it's working everywhere. – Christian Nov 20 '19 at 10:02
  • JFTR: I'm running Firefox 70.0.1 (64-bit) on Windows 10 and tried Chrome 78.0.3904.97. No flag for me. :( Must have to do something with available character sets/fonts. – Matthias Nov 21 '19 at 09:47

0 Answers0