2

Let say I have the following string: DATA = "".

I want to get an array or list with each single emoji as an element, like so [,,,].

The problem, however, is that the length of emojis vary. So len(u'')is 1, whereas len(u'') is 2.

So how would I split up my DATA? I've seen it been done in JavaScript, but couldn't figure out a way to do it in Python (How can I split a string containing emoji into an array?).

Community
  • 1
  • 1
Hashirun
  • 89
  • 1
  • 9
  • 1
    Possible duplicate of [How to find and count emoticons in a string using python?](http://stackoverflow.com/questions/19149186/how-to-find-and-count-emoticons-in-a-string-using-python) – Eugene Soldatov Oct 14 '15 at 16:42
  • @EugeneSoldatov I've seen that question before as well, but it actually only shows how to count the numbers of emojis correctly. – Hashirun Oct 14 '15 at 17:07
  • Just not use len() function: re.findall(u'[\U0001f600-\U0001f650]', s) – Eugene Soldatov Oct 14 '15 at 17:08
  • 1
    That doesn't work though. The emoji for example is actually a combination of and . So a `re.findall` results in ['', ''] instead of ['']. – Hashirun Oct 14 '15 at 17:13
  • The 3rd party `regex` module can search using Unicode codepoint categories, so you could keep emoji and their modifiers together with the right expression. The codepoint in your example, however, is defined in Unicode 8.0 and would require Python 3.5 as well. – Mark Tolonen Oct 14 '15 at 21:09
  • @MarkTolonen I'm using Python 3.5, so that shouldn't be a problem. Could you elaborate a bit on how I can determine which code points belong to each other? If I'm looking at the code points for '', it is (as hex) `1f44d+1f3fe`, but that would be the same result for ' ', no? – Hashirun Oct 14 '15 at 22:31

2 Answers2

3

Using the 3rd party regex module (pip install regex) and Python 3.5:

>>> import regex
>>> s = '\U0001f680\U0001f618\U0001f44d\U0001f3fe\U0001f1e6\U0001f1ee'
>>> import unicodedata as ud
>>> ud.category(s[0])
'So'
>>> ud.category(s[1])
'So'
>>> ud.category(s[2])
'So'
>>> ud.category(s[3])
'Sk'
>>> ud.category(s[4])
'So'
>>> ud.category(s[5])
'So'
>>> regex.findall(r'\p{So}\p{Sk}*',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6', '\U0001f1ee']

Edit:

The national flags are a two-letter regional indicator symbol from the range U+1F1E6 - U+1F1FF. It turns out regex has a grapheme cluster \X switch, but it finds the flags but not the skin tone marker.

>>> regex.findall(r'\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d', '\U0001f3fe', '\U0001f1e6\U0001f1ee']

However, you could look for symbol modifiers OR grapheme clusters:

>>> regex.findall(r'.\p{Sk}+|\X',s)
['\U0001f680', '\U0001f618', '\U0001f44d\U0001f3fe', '\U0001f1e6\U0001f1ee']

There may be other exceptions.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • Almost works! Not sure why, but the flag emojis are a bit different. Do you know if they are supported? The last to unicodes in your `s` (`\U0001f1e6\U0001f1ee`) should be one emoji, i.e. . Are flags not supported in unicode 8.0 yet? – Hashirun Oct 15 '15 at 08:11
  • Ok, I just read that flags are a combination of two regional indicator symbols ranging between `U+1F1E6` to `U+1F1FF`. So easiest way is probably to go over the resulting list and look for the range? – Hashirun Oct 15 '15 at 08:18
  • Good answer. However, I find that \X seem to match regular ascii and other characters (pretty much all chars)? Which makes the emoji detection less effective. – Xerion Jun 27 '16 at 16:23
  • @Xerion \X matches grapheme clusters, which includes single code points as well. If you want particular characters, you still need to search for code point ranges. – Mark Tolonen Jun 28 '16 at 00:05
0

If you want a Python version of the JavaScript solution in How can I split a string containing emoji into an array?, then this should do the trick:

import re

pattern = re.compile(r'([\uD800-\uDBFF][\uDC00-\uDFFF])')

def emojiString2List(text):
    return list(x for x in pattern.split(text) if x != '')

Notice that Python's str.split() method does not accept a regex (while JS's does), therefore you have to use the re library to split using a regex. Also, by using a Python list comprehension, the code is much shorter, but the behavior should be identical. That said, I haven't fully tested the above code. At least it should get you pointed in the right direction.

Community
  • 1
  • 1
Waylan
  • 37,164
  • 12
  • 83
  • 109
  • This solution only works in narrow build of Python. If you really want to process emoji and stuffs, you should use at least Python 3.3. It also doesn't take care of the case of – nhahtdh Oct 15 '15 at 03:30