How to properly iterate over unicode characters in Python

Question

I would like to iterate over a string and output all emojis.

I'm trying to iterate over the characters, and check them against an emoji list.

However, python seems to split the unicode characters into smaller ones, breaking my code. Example:

>>> list(u'Test \U0001f60d')
[u'T', u'e', u's', u't', u' ', u'\ud83d', u'\ude0d']

Any ideas why u'\U0001f60d' gets split?

Or what's a better way to extract all emojis? This was my original extraction code:

def get_emojis(text):
  emojis = []
  for character in text:
    if character in EMOJI_SET:
      emojis.append(character)
  return emojis

I cannot reproduce it on Python 2.7 nor on Python 2.6 (and I don't have older versions at hand). When I look at `list(u'Test \U0001f60d')` I get `[u'T', u'e', u's', u't', u' ', u'\U0001f60d']`. What version of Python are you using? — Alfe, Oct 12 '17 at 14:24
This is how wide unicode character are [internally represented on narrow builds](https://stackoverflow.com/questions/29109944/python-returns-length-of-2-for-single-unicode-character-string). This should be fixed in python3.3+ where the internal representation [was changed](https://www.python.org/dev/peps/pep-0393/) — mata, Oct 12 '17 at 14:38
Also, you can flip your loop and iterate over emojis instead of original string — Yaroslav Surzhikov, Oct 12 '17 at 14:45

Mark Tolonen · Accepted Answer · 2017-10-12T16:48:08.147

8

Python pre-3.3 uses UTF-16LE (narrow build) or UTF-32LE (wide build) internally for storing Unicode, and due to leaky abstraction exposes this detail to the user. UTF-16LE uses surrogate pairs to represent Unicode characters above U+FFFF as two codepoints. Either use a wide Python build or switch to Python 3.3 or later to fix the issue.

One way of dealing with a narrow build is to match the surrogate pairs:

Python 2.7 (narrow build):

>>> s = u'Test \U0001f60d'
>>> len(s)
7
>>> re.findall(u'(?:[\ud800-\udbff][\udc00-\udfff])|.',s)
[u'T', u'e', u's', u't', u' ', u'\U0001f60d']

Python 3.6:

>>> s = 'Test \U0001f60d'
>>> len(s)
6
>>> list(s)
['T', 'e', 's', 't', ' ', '']

edited Oct 12 '17 at 16:48

answered Oct 12 '17 at 16:41

Mark Tolonen

166,664
26
169
251

I don't know why but I don't think it will work for all unicode. Try `Test ` – Tom Wojcik Jan 05 '22 at 13:40
1

@TomWojcik Unicode strings are made up of Unicode code points, but some code points combine with others to make graphemes (single visual characters). Flags are made of two code points, for example. – Mark Tolonen Jan 05 '22 at 16:19
TIL, thanks. So it splits correctly to (sometimes multiple) unicode representations, but I assume OP needed graphemes (emojis as expected by the end user). – Tom Wojcik Jan 05 '22 at 21:43
1

@TomWojcik Graphemes are a complicated topic, but IIRC the 3rd party `regex` library has a `\g` that can be used. More than emojis can have multiple code points. Unicode just gets more complicated. This post is 5 years old. – Mark Tolonen Jan 06 '22 at 00:04

score 1 · Answer 2 · answered Jan 05 '22 at 13:50

I've been fighting myself with Unicode and it's not as easy as it seems. There's this emoji library that wraps all the caveats (I'm not affiliated).

If you want to list all emojis that appear in the string, I'd recommend emoji.emoji_lis.

Just look into the source of emoji.emoji_lis to understand how complicated it actually is.

Example

>>> emoji.emoji_lis('')
>>> [{'location': 0, 'emoji': ''}, {'location': 1, 'emoji': ''}, {'location': 2, 'emoji': ''}]

Example with list (won't always work)

>>> list('')
>>> ['', '', '', '']

Melissa Stewart · Answer 3 · 2017-10-12T15:14:12.817

0

Try this,

import re
re.findall(r'[^\w\s,]', my_list[0])

The regex r'[^\w\s,]' matches any character that is not a word, whitespace or comma.

edited Oct 12 '17 at 15:14

answered Oct 12 '17 at 14:19

Melissa Stewart

3,483
11
49
88

That still splits the emoji into two characters: `>>> re.findall(r'[^\w\s,]', u'Test \U0001f60d')` `[u'\ud83d', u'\ude0d']` – Vinicius Fortuna Oct 14 '17 at 17:56

How to properly iterate over unicode characters in Python

3 Answers3

Linked

Related