1

I am working on a file that contains big amount of data that also includes emojis. I am using openrefine to clean the data but I am unable to find a short cut to remove common emojis like smiley face which is included alot on the data I tried some regular expression and it worked for a few emojis but some still remain. below is the code i tried in search and replace

"[\p{C}]|[\p{So}]|[\u20E3]"
Danyah
  • 30
  • 5

2 Answers2

1

Constructing a regex to match all Unicode emoji is non-trivial, but there's a Github repo with a script to build it based on the Unicode standard (as well as the output of that script) available here:

https://github.com/mathiasbynens/emoji-regex

Tom Morris
  • 10,490
  • 32
  • 53
1

Could you try this code, using Jython/Python instead of Grel?

import re

def remove_emojis(data):
    emoj = re.compile("["
        u"\U0001F600-\U0001F64F"  # emoticons
        u"\U0001F300-\U0001F5FF"  # symbols & pictographs
        u"\U0001F680-\U0001F6FF"  # transport & map symbols
        u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
        u"\U00002500-\U00002BEF"  # chinese char
        u"\U00002702-\U000027B0"
        u"\U00002702-\U000027B0"
        u"\U000024C2-\U0001F251"
        u"\U0001f926-\U0001f937"
        u"\U00010000-\U0010ffff"
        u"\u2640-\u2642" 
        u"\u2600-\u2B55"
        u"\u200d"
        u"\u23cf"
        u"\u23e9"
        u"\u231a"
        u"\ufe0f"  # dingbats
        u"\u3030"
                      "]+", re.UNICODE)
    return re.sub(emoj, '', data)

return remove_emojis(value)

Screenshot

enter image description here

Source

Ettore Rizza
  • 2,800
  • 2
  • 11
  • 23