0

I have a string which I know is a single Unicode character. I need to check if it is an emoji. My library currently has no dependencies, and I would like to keep it that way.

Someone
  • 121
  • 1
  • 7
  • 1
    It's not a simple request. See https://stackoverflow.com/questions/30470079/emoji-value-range – Mark Ransom May 18 '22 at 22:49
  • I have some clever perl code that can do this, and some not-as-clever java, if you think that might work instead. – Shawn May 19 '22 at 16:35

1 Answers1

2

The short answer is that you're in for a lot of work!

Years ago, when there were only a few emoji, this was a simpler problem. Unicode is arranged into "code blocks", described here:

https://unicode.org/charts/

A given code block is defined by a range. For example, "Emoticons" (note that this block is a tiny subset of modern emoji) is the range 1F600–1F64F. Thus:

0x1F600 <= ord(ch) <= 0x1F64F

will be true if ch is in the Emoticon code block.

However! Today, your problem is complicated by the following:

  1. there is no single code block containing all emoji
  2. the emoji-containing code blocks are not contiguous
  3. an emoji-containing code block may not contain pure emoji
  4. emoji can be composite, made up using combining marks (like skin tone)
  5. some emoji sequences are equivalent to others, due to the previous point, in the same way that Unicode has more than one way to represent accented-lowercase-e, so your string may require normalization.

These issues, and many others, are enumerated here:

https://www.unicode.org/reports/tr51/tr51-19.html

In order to manage this correctly without including dependencies, you'll end up writing your own standard-conforming code, reproducing the current set of lookups and combining-mark rules. This isn't impossible, since the data and the rules are freely available, but it is detailed work. The emoji data file index is here:

https://www.unicode.org/reports/tr51/tr51-19.html#emoji_data

The files linked through the above are fixed-format text files, which are designed to be easy to parse. For example, this file:

https://www.unicode.org/Public/emoji/14.0/emoji-test.txt

which is almost 5000 lines, is described as the best definition of the full set. Note that the first "column" of that file (to the left of the semicolon) sometimes has one code point, and sometimes has multiple:

1F617                                                  ; fully-qualified     #  E1.0 kissing face
263A FE0F                                              ; fully-qualified     # ☺️ E0.6 smiling face

At the risk of giving you false hope, the following Python expression describes the set of possible Unicode emoji sequences according to that file. Searching for these is more expensive than searching for single characters, of course. You might consider constructing a trie.

{
    "".join([chr(int(x, base=16)) for x in line.split(";")[0].strip().split()])
    for line in requests.get("https://www.unicode.org/Public/emoji/14.0/emoji-test.txt")
    .content.decode()
    .splitlines()
    if not line.startswith("#")
}

Best of luck if you embark on this! Remember that you may need to rebuild or regenerate your code at intervals in order to remain conformant with the latest releases. It's unfortunately easy to write something which is close enough to seem like it's working, only to be surprised by some long-tail code point sequence.

Derek T. Jones
  • 1,800
  • 10
  • 18
  • My code already splits at the boundaries of Unicode codepoints. If combining characters were separated, that would be fine. Skin tone, gender, etc modifiers should be separated before they get to the emoji-detector function – Someone May 18 '22 at 23:50
  • I'm iterating over the text with `for character in text`, and need to check if each `character` is an emoji. – Someone May 18 '22 at 23:53
  • There are also emoji sequences, like the Emoji Flag Sequence, which I'm not sure you can test on a single-character basis. https://www.unicode.org/reports/tr51/tr51-19.html#Emoji_Sequences – Derek T. Jones May 19 '22 at 00:35
  • Okay, thank you. If, for example, I have the emoji ‍♀️, that splits to `[b'\xf0\x9f\x99\x8b', b'\xf0\x9f\x8f\xbf', b'\xe2\x80\x8d', b'\xe2\x99\x80', b'\xef\xb8\x8f']` (as `str`s, that is `['', '', '\u200d', '♀', '️']`). Ideally, I want it to recognize only as an emoji and handle the others as non-emojis. I want ‍♀️‍♀️‍♀️‍♀️‍♀️‍♀️‍♂️‍♂️‍♂️‍♂️‍♂️‍♂️ to all be treated as the same emoji. – Someone May 19 '22 at 01:19
  • This sounds similar to search-term normalization, so that one can search for "café" with "cafe", etc., using NFKC then dropping diacritics. I suppose you could try the same approach here, though the ZWJ problem is harder. Look for characters with `Emoji_Modifier_Base` in https://www.unicode.org/Public/13.0.0/ucd/emoji/emoji-data.txt . – Derek T. Jones May 21 '22 at 17:38