1

For a data science project I am tasked with the cleanup of our twitter data. The tweets contain unicode encoded emojis (and other stuff) in the form of \ud83d\udcf8 (camera emoji) or \ud83c\uddeb\ud83c\uddf7 (french flag) for example.

I am using the python-package "re" and so far I was successful in removing "simple" unicodes like \u201c (double quotation mark) with something like

text = re.sub(u'\u201c', '', text)

However, when I am trying to remove more complex structures, like for example

text = re.sub(u'\ud83d\udcf8', '', text) # remove camera emoji
text = re.sub(u'\ud83c\uddeb\ud83c\uddf7', '', text) # remove french flag emoji

nothing is happening, no matter if I prefix the string with an 'u', an 'r' or nothing at all. The unicode remains in the string.

EDIT: Thanks to @Shawn Shroyer's answer i found out that

text = re.sub(u'\\ud83d\\udcf8', '', text)

works fine! I just had to escape the backslashes. Now only my second problem remains (see below).

The second problem is that I don't want to have to specify every single emoji individually, but instead I would like to remove them all in a much simpler fashion, but without removing ALL unicode characters, because I need to retain stuff like \u2019 (single quotation mark).

EXQuIsIIt
  • 45
  • 8

1 Answers1

1

My suggestion would be to create an array of values you would like to replace and you need to escape the \ by adding another backslash, or adding 'ur' before your string so backslashes do not need to be escaped.

import re
to_remove_arr = [u"\ud83d\udcf8", u"\ud83c\uddeb\ud83c\uddf7"]
pattern_str = "|".join(to_remove_arr)    
text = re.sub(pattern_str, "", text)

Edit: the above solution will remove specific unicode characters - to remove all non-ASCII Unicode characters:

text = text.encode("ascii", "ignore").decode()

Edit: to remove only emojis I found:

def strip_emoji(text):
    RE_EMOJI = re.compile(u'([\U00002600-\U000027BF])|([\U0001f300-\U0001f64F])|([\U0001f680-\U0001f6FF])')
    return RE_EMOJI.sub(r'', text)
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
Shawn Shroyer
  • 901
  • 1
  • 8
  • 18
  • 1
    Edited answer to include re.sub and combining the array of unicode characters into a single regex expression – Shawn Shroyer Jan 05 '21 at 16:55
  • thank you, this way I at least would have a working solution! But it would be tiresome to collect every single emoji individually, so hopefully someone has an idea to remove them all together ;) – EXQuIsIIt Jan 05 '21 at 17:00
  • Sorry - based on the question I thought you were looking to only remove specific unicode characters. Edited answer with a solution to remove all non-ascii characters – Shawn Shroyer Jan 05 '21 at 17:09
  • 1
    Saw your edit and included code to strip only emojis – Shawn Shroyer Jan 05 '21 at 17:15
  • 1
    That last one was the solution I was looking for, thanks! Some emojis are still slipping through, but I just have to adjust the ranges or add more. Thank you very much! :) – EXQuIsIIt Jan 05 '21 at 17:39