Removing emojis and special characters in Python

Question

I hate a dataset that looks like this called df_bios:

{'userid': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, 'text_string': {0: 'I live in Miami and work in software', 1: 'Chicago, IL', 2: 'Dog Mom in Cincinnati , 3: 'Accountant at @EY/Baltimore', 4: 'World traveler but I call Atlanta home', 5: '⚡️❤️‍ sc/-emmabrown1133 @shefit EMMA15', 6: 'Working in Orlando. From Korea.'}}

I'm trying to remove all the unnecessary emojis (as well as any other special characters, symbols, pictographs, etc...)

I tried using the answer provided here, but it didn't do anything:

import re
def remove_emojis(df_bios):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', df_bios)

It didn't return any errors, it just returned the same data without any changes.

I cannot reproduce the problem. If I try passing a string containing emojis to `remove_emojis`, the emojis are removed. If the question is actually "how do I apply a function that transforms a string, to the strings in a Pandas dataframe?", then **ask that** (and tag the question appropriately) - but I am pretty sure that is a duplicate anyway. You should well understand [ask] by now, but there's a reminder. Please also read https://meta.stackoverflow.com/questions/261592 and [mre]. — Karl Knechtel, Sep 21 '22 at 16:43
There is no answer to write @wizkids121 The regex you posted works when applied to a string of emojis, could you provide a copy of `df_bios`? Is it a dataframe, is it a dict? — PacketLoss, Sep 21 '22 at 16:44
... What? Nothing prevents you from [edit]ing the question without an answer. — Karl Knechtel, Sep 21 '22 at 16:44
@PacketLoss - I believe it is a dateframe. I was just trying to make it reproducible. I don't really know Python. But when I use R, there is something called the `dput` function and I was trying to use the Python equivalent of that. — wizkids121, Sep 21 '22 at 16:46
Karl, I do not understand what you're asking for. I posted what the structure of the dataframe looks like in my question. Can you please instruct as to what you're asking for? People are downvoting my question and I don't get why — wizkids121, Sep 21 '22 at 16:49

score 1 · Accepted Answer · answered Sep 21 '22 at 16:52

You can apply your remove_emojis function to your dataframe column. This will replace your emojis with nothing.

import pandas as pd

def remove_emojis(df_bios):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', df_bios)


data = {'userid': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7}, 'text_string': {0: 'I live in Miami and work in software', 1: 'Chicago, IL', 2: 'Dog Mom in Cincinnati ', 3: 'Accountant at @EY/Baltimore', 4: 'World traveler but I call Atlanta home', 5: '⚡️❤️‍sc/-emmabrown1133@shefit EMMA15', 6: 'Working in Orlando. From Korea.'}}

df_bios = pd.DataFrame(data)

df_bios.text_string = df_bios['text_string'].apply(remove_emojis)

Outputs

   userid                             text_string
0       1    I live in Miami and work in software
1       2                             Chicago, IL
2       3                  Dog Mom in Cincinnati 
3       4             Accountant at @EY/Baltimore
4       5  World traveler but I call Atlanta home
5       6         sc/-emmabrown1133@shefit EMMA15
6       7         Working in Orlando. From Korea.

Removing emojis and special characters in Python

1 Answers1