How to encode emojis that are in text with Python/pandas (for counting them/finding most frequently occurring, etc)?

Question

I am working in Python with pandas and I have a data frame in which one of its columns contain phrases that include emojis, such as "when life gives you s, make lemonade" or "Catch a falling ⭐️ and put it in your pocket". Not all the phrases have emojis and if they do, it could be anywhere in the phrase (not just the beginning or end). I want to go through each text, and essentially count the frequencies for each of the emojis that appear, the emojis that appear the most, etc. I am not sure how to actually process/recognize the emojis. If I go through each of the texts in the column, how would I go about identifying the emoji so I can gather the desire information such as counts, max, etc.

Possible duplicate of [How to find and count emoticons in a string using python?](http://stackoverflow.com/questions/19149186/how-to-find-and-count-emoticons-in-a-string-using-python) — hashcode55, Feb 25 '17 at 09:40
The solutions posted there doesn't work for me. If you're familiar with this, would you be willing to help? — Jane Sully, Feb 25 '17 at 18:15
Yeah sure! I think the solutions are not working for you because the emoticons you have in your phrases are outside the range of unicode they have taken in the answers... Try re-adjusting the range and it should work. — hashcode55, Feb 25 '17 at 19:08
Okay! That makes complete sense. Do you know how to find suitable ranges. I still have some other emojis that aren't being recognized and am not sure how to appropriately increase the range? I appreciate your help! — Jane Sully, Feb 25 '17 at 21:05
Great! Last question, I promise. If you don't mind me asking, how did you determine the range? I would like to be able to come up with that myself, but I am not really sure how to. Thanks again :) — Jane Sully, Feb 25 '17 at 21:23
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/136646/discussion-between-hashcode55-and-jane-sully). — hashcode55, Feb 25 '17 at 21:24

score 3 · Accepted Answer · edited May 23 '17 at 11:46

Suppose you have a dataframe like this

import pandas as pd
from collections import defaultdict

df = pd.DataFrame({'phrases' : ["Smiley emoticon rocks! I like you.\U0001f601", 
                                "Catch a falling ⭐️ and put it in your pocket"]})

which yields

                 phrases
0   Smiley emoticon rocks! I like you.
1   Catch a falling ⭐️ and put it in your pocket

You can do something like :

# Dictionary storing emoji counts 
emoji_count = defaultdict(int)
for i in df['phrases']:
    for emoji in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000]', i):
        emoji_count[emoji] += 1

print (emoji_count)

Note that I have changed the range in re.findall(u'[\U0001f300-\U0001f650]|[\u2000-\u3000', i).

The alternate part is to handle different unicode group, but you should get the idea.

In Python 2.x you can convert the emoji to unicode using

unicode('⭐️ ', 'utf-8') # u'\u2b50\ufe0f' - output

Output :

defaultdict(int, {'⭐': 1, '': 1, '': 1})

That regex is shamelessly stolen from this link.

How to encode emojis that are in text with Python/pandas (for counting them/finding most frequently occurring, etc)?

1 Answers1