How to check the Emoji property of a character in Python?

Question

In unicode a character can have an Emoji property.

Is there a standard way in Python to determine if a character is an Emoji?

I know of unicodedata, but it doesn't appear to expose all these extra character details.

Note: I'm asking about the specific attribute called "Emoji" in the unicdoe standard, as provided in the link. I don't want to have an arbitrary list of pattern ranges, and preferably use a standard library.

Possible duplicate of [removing emojis from a string in Python](https://stackoverflow.com/questions/33404752/removing-emojis-from-a-string-in-python) — kabanus, Jul 05 '17 at 12:03
@kabanus Not a duplicate. The other questions designates a random list of characters as emoji, I'm asking about ones specifically marked as emoji by the Unicode standard. — edA-qa mort-ora-y, Jul 05 '17 at 12:09

edA-qa mort-ora-y · Accepted Answer · 2017-07-08T18:30:57.430

This is the code I ended up creating to load the Emoji information. The get_emoji function gets the data file, parses it, and calls the enumeraton callback. The rest of the code uses this to produce a JSON file of the information I needed.

#!/usr/bin/env python3
# Generates a list of emoji characters and names in JS format
import urllib.request
import unicodedata
import re, json

'''
Enumerates the Emoji characters that match an attributes from the Unicode standard (the Emoji list).

@param on_emoji A callback that is called with each found character. Signature `on_emoji( code_point_value )`
@param attribute  The attribute that  is desired, such as `Emoji` or `Emoji_Presentation`
'''
def get_emoji(on_emoji, attribute):
    with urllib.request.urlopen('http://www.unicode.org/Public/emoji/5.0/emoji-data.txt') as f:
        content = f.read().decode(f.headers.get_content_charset())

        cldr = re.compile('^([0-9A-F]+)(..([0-9A-F]+))?([^;]*);([^#]*)#(.*)$')
        for line in content.splitlines():
            m = cldr.match(line)
            if m == None:
                continue

            line_attribute = m.group(5).strip()
            if line_attribute != attribute:
                continue

            code_point = int(m.group(1),16)
            if m.group(3) == None:
                on_emoji(code_point)
            else:
                to_code_point = int(m.group(3),16)
                for i in range(code_point,to_code_point+1):
                    on_emoji(i)


# Dumps the values into a JSON format
def print_emoji(value):
    c = chr(value)
    try:
        obj = {
            'code': value,
            'name': unicodedata.name(c).lower(),
        }
        print(json.dumps(obj),',')
    except:
        # Unicode DB is likely outdated in installed Python
        pass

print( "module.exports = [" )
get_emoji(print_emoji, "Emoji_Presentation")
print( "]" )

That solved my original problem. To answer the question itself it'd just be a matter of sticking the results into a dictionary and doing a lookup.

score -1 · Answer 2 · answered Jul 05 '17 at 12:01

-1

I have used the following regex pattern successfully before

import re

emoji_pattern = re.compile("["
                               u"\U0001F600-\U0001F64F"  # emoticons
                               u"\U0001F300-\U0001F5FF"  # symbols & pictographs
                               u"\U0001F680-\U0001F6FF"  # transport & map symbols
                               u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                               "]+", flags=re.UNICODE)

Also check out this question: removing emojis from a string in Python

answered Jul 05 '17 at 12:01

Nick Chapman

4,402
1
27
41

These ranges are not the ones contained in the standard Unicode list of emoji data. – edA-qa mort-ora-y Jul 05 '17 at 12:07
@edA-qamort-ora-y well I would still do the same thing but just expand it to include the entire range. – Nick Chapman Jul 05 '17 at 12:08

How to check the Emoji property of a character in Python?

2 Answers2

Linked