My research interest is the effect of emojis in text for sentiment analysis. I would like to extract all the emojis from my dataset. So far I have done the following:
import re
from emoji import UNICODE_EMOJI
emoji_1 = re.compile('[\\u203C-\\u3299\\U0001F000-\\U0001F644]')
emoji_list= list(filter(emoji_1.match, df['Tweet text']))
emo_found= ' '.join(emoji for emoji in emoji_list)
def get_emoji_set(text):
return {letter for letter in text if letter in UNICODE_EMOJI['en'] }
c = get_emoji_set(emo_found)
Print(c)
But it is not extracting all the images. So far I got only the following emojis using the above code:
{'', '', '', '', '', '', '', ''}
However these are only the partial emojis that are present in the dataset. There are also the following emojis present in my dataset which am not getting in the result:
, , , , ,,,, + more emojis
Why is my code not extracting all the emojis from my dataset, is there any emojis left as i defined in emoji_1? Is there any more ranges that i should compile using regex ?
I have tried the following answer, but it does not return anything. I get an empty column.