4

I would like to iterate over a string and output all emojis.

I'm trying to iterate over the characters, and check them against an emoji list.

However, python seems to split the unicode characters into smaller ones, breaking my code. Example:

>>> list(u'Test \U0001f60d')
[u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d']

Any ideas why u'\U0001f60d' gets split?

Or what's a better way to extract all emojis? This was my original extraction code:

def get_emojis(text):
  emojis = []
  for character in text:
    if character in EMOJI_SET:
      emojis.append(character)
  return emojis
Vinicius Fortuna
  • 393
  • 4
  • 11
  • 4
    I cannot reproduce it on Python 2.7 nor on Python 2.6 (and I don't have older versions at hand). When I look at `list(u'Test \U0001f60d')` I get `[u'T', u'e', u's', u't', u' ', u'\U0001f60d']`. What version of Python are you using? – Alfe Oct 12 '17 at 14:24
  • This is how wide unicode character are [internally represented on narrow builds](https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string). This should be fixed in python3.3+ where the internal representation [was changed](https://www.python.org/dev/peps/pep-0393/) – mata Oct 12 '17 at 14:38
  • Also, you can flip your loop and iterate over emojis instead of original string – Yaroslav Surzhikov Oct 12 '17 at 14:45

3 Answers3

8

Python pre-3.3 uses UTF-16LE (narrow build) or UTF-32LE (wide build) internally for storing Unicode, and due to leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs to represent Unicode characters above U+FFFF as two codepoints. Either use a wide Python build or switch to Python 3.3 or later to fix the issue.

One way of dealing with a narrow build is to match the surrogate pairs:

Python 2.7 (narrow build):

>>> s = u'Test \U0001f60d'
>>> len(s)
7
>>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
[u'T', u'e', u's', u't', u' ', u'\U0001f60d']

Python 3.6:

>>> s = 'Test \U0001f60d'
>>> len(s)
6
>>> list(s)
['T', 'e', 's', 't', ' ', '']
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • I don't know why but I don't think it will work for all unicode. Try `Test ` – Tom Wojcik Jan 05 '22 at 13:40
  • 1
    @TomWojcik Unicode strings are made up of Unicode code points, but some code points combine with others to make graphemes (single visual characters). Flags are made of two code points, for example. – Mark Tolonen Jan 05 '22 at 16:19
  • TIL, thanks. So it splits correctly to (sometimes multiple) unicode representations, but I assume OP needed graphemes (emojis as expected by the end user). – Tom Wojcik Jan 05 '22 at 21:43
  • 1
    @TomWojcik Graphemes are a complicated topic, but IIRC the 3rd party `regex` library has a `\g` that can be used. More than emojis can have multiple code points. Unicode just gets more complicated. This post is 5 years old. – Mark Tolonen Jan 06 '22 at 00:04
1

I've been fighting myself with Unicode and it's not as easy as it seems. There's this emoji library that wraps all the caveats (I'm not affiliated).

If you want to list all emojis that appear in the string, I'd recommend emoji.emoji_lis.

Just look into the source of emoji.emoji_lis to understand how complicated it actually is.

Example

>>> emoji.emoji_lis('')
>>> [{'location': 0, 'emoji': ''}, {'location': 1, 'emoji': ''}, {'location': 2, 'emoji': ''}]

Example with list (won't always work)

>>> list('')
>>> ['', '', '', '']
Tom Wojcik
  • 5,471
  • 4
  • 32
  • 44
0

Try this,

import re
re.findall(r'[^\w\s,]', my_list[0])

The regex r'[^\w\s,]' matches any character that is not a word, whitespace or comma.

Melissa Stewart
  • 3,483
  • 11
  • 49
  • 88