0

I have a list of tweets that has been delivered as a csv. But when I read them, the emojis unicode has been converted as str and I can't translate them to their real name ("waffle" or "heart").

def load_csv(csv_name):
    path = os.getcwd()
    df = pd.read_csv(path + "/" + csv_name, header=0, index_col=0, parse_dates=True, sep=",", encoding="utf-8")
    return df

csv_name = "tweets_nikekaepernick.csv"
df = load_csv(csv_name)

text = df["tweet_full_text"].iloc[0]
text

Out[]: 'Hi <U+0001F602><U+0001F602><U+0001F480><U+0001F480><U+0001F480><U+0001F480>'
cigien
  • 57,834
  • 11
  • 73
  • 112
eoterochio
  • 13
  • 1
  • If I create a .csv with UTF emoticons like the ones you mention, and use `read_csv` to read it into a `DataFrame`, Python correctly prints the emoticons when accessing values from it. If you can provide a link to a data file that has this problem, that might shed some light on why you're seeing this behaviour. Also, can you mention the specific version of Python you're using? – Grismar Oct 06 '21 at 22:08
  • Possibly the output device you're running the interpreter on doesn't support UTF-8 text output? How exactly are you running the above code? – Grismar Oct 06 '21 at 22:09
  • Hi, i'm running this on Jupyter Notebook, with Python 3.8.8 [link] (https://drive.google.com/file/d/1oldvYLOD1NpKSbrAKDwARe-LXpDiFIsi/view?usp=sharing) – eoterochio Oct 06 '21 at 23:54
  • As an aside, `pd.read_csv(os.getcwd() + "/" + "filename", ...)` is just a really inconvenient and clumsy way to say `pd.read_csv("filename", ...)`. Perhaps see also [What exactly is current working directory?](https://stackoverflow.com/questions/45591428/what-exactly-is-current-working-directory/66860904) – tripleee Oct 07 '21 at 06:02

1 Answers1

1

Try it with demoji . You can get more details about demoji at here.

code

import re
import demoji
demoji.download_codes()

text = 'Hi <U+0001F602><U+0001F602><U+0001F480><U+0001F480><U+0001F480><U+0001F480>'

# changed format with regex
text_ = re.sub('\+|>','',text).replace('<','\\').encode().decode('unicode-escape')

#find emoji
demoji.findall(text_)

result

demoji.findall(text_)
Out[1]: {'': 'skull', '': 'face with tears of joy'}

More

For more, if you wants to remove emojis, you can try the below code, which is referring form here:

pattern = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                           "]+", flags=re.UNICODE)

print(pattern.sub(r'', text_))
>>> Hi

Or, if you wants to translate your emoji to str, you can try:

import emoji
print(emoji.demojize(text_))

>>> Hi :face_with_tears_of_joy::face_with_tears_of_joy::skull::skull::skull::skull:
  • Thanks for the answer. However, it still doesn't work :( Even if I paste the string as it appears, it won't find any emoji. Actually I never get to see the emoji image, I only get this strings. I tried ` import demoji demoji.download_codes() text = 'Hi ' demoji.findall(text) ` but I get ' {} ' – eoterochio Oct 07 '21 at 01:23