Tokenize Sentences or Tweets with Emoji Skin Tone Modifiers

Question

I want to tokenize a tweet containing multiple emojis and they are not space-separated. I tried both NLTK TweetTokenizer and Spacy but they fail to tokenize Emoji Skin Tone Modifiers. This needs to be applied to a huge dataset so performance might be an issue. Any suggestions?

You may need to use Firefox or Safari to see the exact color tone emoji because Chrome sometimes fails to render it!

# NLTK
from nltk.tokenize.casual import TweetTokenizer
sentence = "I'm the most famous emoji  but what about  and "
t = TweetTokenizer()
print(t.tokenize(sentence))

# Output
["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']

And

# Spacy
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = nlp("I'm the most famous emoji  but what about  and ")
print([token.text for token in sentence])

Output
['I', "'m", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']

Expected Output

["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '']

What color shading has to do with the emoji unicode? – Sergey Bushmanov Sep 29 '20 at 20:45 — Sergey Bushmanov, Sep 29 '20 at 20:45

score 4 · Answer 1 · answered Sep 29 '20 at 04:02

4

You should try using spacymoji. It's an extension and pipeline component for spaCy that can optionally merge combining emoji like skin tone modifiers into single token.

Based on the README you can do something like this:

import spacy
from spacymoji import Emoji

nlp = spacy.load('en')
emoji = Emoji(nlp, merge_spans=True) # this is actually the default
nlp.add_pipe(emoji, first=True)

doc = nlp(...)

That should do it.

answered Sep 29 '20 at 04:02

polm23

14,456
7
35
59

This is helpful. Thanks. Do you think a regex can do the same job? – Abu Shoeb Sep 29 '20 at 13:40
Maybe, but I wouldn't recommend it, it'd just be harder to work with. – polm23 Sep 29 '20 at 16:06
Ok, thanks. The reason I asked for is Spacy tokenize `I'm` as `I` and `'m` which is not the desired token I want. – Abu Shoeb Sep 29 '20 at 17:02
Unless there's a whole lot of cases like that I would recommend just using some post-processing to fix the spaCy tokens. For cases like `I'm` it should be very simple. Alternately you could use spaCy to replace emoji with filler tokens like `EMOJI_HAND` and then use another tokenizer you prefer; it'd be slow but give you a lot of control. – polm23 Sep 30 '20 at 02:31

score 1 · Answer 2 · answered Sep 29 '20 at 13:52

Skin tone modifiers are just a set of hex codes utilized in conjunction with the emoji's base hex code. These are the skin tone modifiers : http://www.unicode.org/reports/tr51/#Diversity

You can use spacy retokenizer's merge method after finding the bounds of a token which is an emoji + its skin tone modifier.

See this answer of mine for how to merge tokens based on regex pattern : https://stackoverflow.com/a/43390171/533399

Tokenize Sentences or Tweets with Emoji Skin Tone Modifiers

2 Answers2