I want to tokenize a tweet containing multiple emojis and they are not space-separated. I tried both NLTK TweetTokenizer
and Spacy
but they fail to tokenize Emoji Skin Tone Modifiers. This needs to be applied to a huge dataset so performance might be an issue. Any suggestions?
You may need to use Firefox or Safari to see the exact color tone emoji because Chrome sometimes fails to render it!
# NLTK
from nltk.tokenize.casual import TweetTokenizer
sentence = "I'm the most famous emoji but what about and "
t = TweetTokenizer()
print(t.tokenize(sentence))
# Output
["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']
And
# Spacy
import spacy
nlp = spacy.load("en_core_web_sm")
sentence = nlp("I'm the most famous emoji but what about and ")
print([token.text for token in sentence])
Output
['I', "'m", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '', '', '']
Expected Output
["I'm", 'the', 'most', 'famous', 'emoji', '', '', '', 'but', 'what', 'about', '', 'and', '', '', '', '']