22

I am working in Python 2 and I have a string containing emojis as well as other unicode characters. I need to convert it to a list where each entry in the list is a single character/emoji.

x = u'xyz'
char_list = [c for c in x]

The desired output is:

['', '', 'x', 'y', 'z', '', '']

The actual output is:

[u'\ud83d', u'\ude18', u'\ud83d', u'\ude18', u'x', u'y', u'z', u'\ud83d', u'\ude0a', u'\ud83d', u'\ude0a']

How can I achieve the desired output?

jfs
  • 399,953
  • 195
  • 994
  • 1,670
Aaron
  • 2,354
  • 1
  • 17
  • 25
  • I have closed it as a duplicate of a superset question. Go through the answer their clearly. If it still doesn't solve your problem, please [edit] the post to include your additional attempts. – Bhargav Rao Feb 15 '16 at 08:00
  • My question differs from the other one in that I am dealing with strings that contain a mix of emojis and non-emoji characters. Also, I'm not interested in counting the emojis but in getting a list of all of the characters. – Aaron Feb 15 '16 at 17:54
  • To be clear, the list you got is correct. It's just that if you print a `list` it shows the `repr` of the contents, not the `str` form; you need to print the individual entries manually to see the `str` form (that would look like emoji). For example, if you do `print(u', '.join(char_list))` you'll see what you expect without leading or trailing brackets. – ShadowRanger Feb 15 '16 at 17:54
  • 1
    The string input has 7 characters, counting emojis as single characters. The output I get has 11 entries in the list. I need to get an output list with 7 entries corresponding to the characters in the input string. – Aaron Feb 15 '16 at 17:58
  • Which version of Python is it? In Python 2, `x = 'xyz'` is illegal (or would probably be misinterpreted). – ivan_pozdeev Feb 17 '16 at 15:36
  • 2
    A duplicate of http://stackoverflow.com/questions/12907022/python-getting-correct-string-length-when-it-contains-surrogate-pairs – ivan_pozdeev Feb 17 '16 at 15:38
  • @ivan_pozdeev I don't think the answers from that question answer this question. – Uyghur Lives Matter Feb 17 '16 at 17:23
  • 1
    @ivan_pozdeev: it must be Python 2, since the actual output is using `u'...'` string literals to represent the values. Which then does highlight that this question is missing an actual [mcve]. Either `from __future__ import unicode_literals` is missing, or the `u` prefix on the `x` string definition. – Martijn Pieters Feb 17 '16 at 18:08
  • @cpburnz: it answers the actual problem that OP has. In general, emoji may span several Unicode codepoints (`len(emoji) > 1` whatever Python build) e.g., [ (U+1f1eb U+1f1f7)](https://medium.com/@mroth/how-i-built-emojitracker-179cfd8238ac). What is emoji is different in different contexts. The question in the title *"Correctly extract Emojis from a Unicode string"* is too complex (too broad). Fixing OP's problem doesn't answer the question (as [the currently accepted answer](http://stackoverflow.com/a/35462951/4279) demonstrates). – jfs Feb 19 '16 at 16:11
  • @Aaron: the question title should be changed to reflect your actual narrower problem that the accepted answer covers. Otherwise, visitors from google that come here expecting the answer to the broader question from the title might be disappointed. – jfs Feb 19 '16 at 16:21

2 Answers2

17

First of all, in Python2, you need to use Unicode strings (u'<...>') for Unicode characters to be seen as Unicode characters. And correct source encoding if you want to use the chars themselves rather than the \UXXXXXXXX representation in source code.

Now, as per Python: getting correct string length when it contains surrogate pairs and Python returns length of 2 for single Unicode character string, in Python2 "narrow" builds (with sys.maxunicode==65535), 32-bit Unicode characters are represented as surrogate pairs, and this is not transparent to string functions. This has only been fixed in 3.3 (PEP0393).

The simplest resolution (save for migrating to 3.3+) is to compile a Python "wide" build from source as outlined on the 3rd link. In it, Unicode characters are all 4-byte (thus are a potential memory hog) but if you need to routinely handle wide Unicode chars, this is probably an acceptable price.

The solution for a "narrow" build is to make a custom set of string functions (len, slice; maybe as a subclass of unicode) that would detect surrogate pairs and handle them as a single character. I couldn't readily find an existing one (which is strange), but it's not too hard to write:

  • as per UTF-16#U+10000 to U+10FFFF - Wikipedia,
    • the 1st character (high surrogate) is in range 0xD800..0xDBFF
    • the 2nd character (low surrogate) - in range 0xDC00..0xDFFF
    • these ranges are reserved and thus cannot occur as regular characters

So here's the code to detect a surrogate pair:

def is_surrogate(s,i):
    if 0xD800 <= ord(s[i]) <= 0xDBFF:
        try:
            l = s[i+1]
        except IndexError:
            return False
        if 0xDC00 <= ord(l) <= 0xDFFF:
            return True
        else:
            raise ValueError("Illegal UTF-16 sequence: %r" % s[i:i+2])
    else:
        return False

And a function that returns a simple slice:

def slice(s,start,end):
    l=len(s)
    i=0
    while i<start and i<l:
        if is_surrogate(s,i):
            start+=1
            end+=1
            i+=1
        i+=1
    while i<end and i<l:
        if is_surrogate(s,i):
            end+=1
            i+=1
        i+=1
    return s[start:end]

Here, the price you pay is performance, as these functions are much slower than built-ins:

>>> ux=u"a"*5000+u"\U00100000"*30000+u"b"*50000
>>> timeit.timeit('slice(ux,10000,100000)','from __main__ import slice,ux',number=1000)
46.44128203392029    #msec
>>> timeit.timeit('ux[10000:100000]','from __main__ import slice,ux',number=1000000)
8.814016103744507    #usec
Community
  • 1
  • 1
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
  • 2
    Note that with all the recent fancy additions to emoji this is slightly broken, as some emoji consist of multiple code points. Examples include flags (`""`) and etnical variants (`""` vs `""`), and some other things like combining diacritics `"à"`. – roeland Feb 18 '16 at 02:48
  • @roeland then `is_surrogate` needs to be upgraded to detect these as well and return the number of additional words(=2-byte chars) rather than True/False. That's provided we're interested in such cases (control characters and diacritics are a completely different matter if you ask me) and other facilities like normalization can't do the task. – ivan_pozdeev Feb 18 '16 at 05:18
  • 2
    I don't think normalization will handle those emoticons. The strictly correct answer would iterate over grapheme clusters, long and arcane explanation in [Unicode® Standard Annex #29](http://www.unicode.org/reports/tr29/#Grapheme_Cluster_Boundary_Rules). But without a library which can handle that I'd probably stick to iterating over code points. – roeland Feb 18 '16 at 06:07
  • @roeland: even `\X` regex won't help in the general case e.g., some (chat) software shows `:)` (U+003a U+0029) as a smiley face (a picture) i.e., it is an emoji in the given context. – jfs Feb 19 '16 at 16:15
  • @J.F.Sebastian Oh yes. Once upon a time we typed a colon and a bracket. The really old-school people would type a dash as well :-) . But I think the OP is asking about the Unicode emoji characters. – roeland Feb 21 '16 at 20:52
  • @roeland :) works on the **current** version of Skype for iPhone. It is displayed as a smiley face (image) — it is the literal definition for `emoji`: *"a small digital image or icon used to express an idea or emotion in electronic communication"*. `\X` is not enough in the general case. [The title of the question is too broad.](http://stackoverflow.com/questions/35404144/correctly-extract-emojis-from-a-unicode-string#comment58710437_35404144) – jfs Feb 21 '16 at 21:18
10

I would use the uniseg library (pip install uniseg):

# -*- coding: utf-8 -*-
from uniseg import graphemecluster as gc

print list(gc.grapheme_clusters(u'xyz'))

outputs [u'\U0001f618', u'\U0001f618', u'x', u'y', u'z', u'\U0001f60a', u'\U0001f60a'], and

[x.encode('utf-8') for x in gc.grapheme_clusters(u'xyz'))]

will provide the list of characters as UTF-8 encoded strings.

James Hopkin
  • 13,797
  • 1
  • 42
  • 71