9

I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:

u'that\u2019s \U0001f63b'

The apostrophe is represented by \u2019, but the emoji is represented by \U0001f63b. I looked up the code point for the emoji in question, and it's \uf63b. I'm not sure where the 0001 is coming from. I know comically little about character encodings.

When I print the text, character by character, using:

s = u'that\u2019s \U0001f63b'

for c in s:
    print c.encode('unicode_escape')

The program produces the following output:

t
h
a
t
\u2019
s

\ud83d
\ude3b

How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001s before reading it, or is there an easier, less silly way?

Andrew LaPrise
  • 3,373
  • 4
  • 32
  • 50
  • `0xf63b` is in the "Private Use" section of Unicode. Are you sure this is correct? Your codepoint is probably `0x1f63b`, as that's a "smiling cat with heart eyes" emoji. – Alyssa Haroldsen Jul 07 '15 at 22:30
  • How did you determine that `\uf63b` would be an Emoji character? According to my reference, it's undefined: http://www.fileformat.info/info/unicode/char/f63b/index.htm – Mark Ransom Jul 07 '15 at 22:30

2 Answers2

19

I don't think you're using encode correctly, nor do you need to. What you have is a valid unicode string with one 4 digit and one 8 digit escape sequence. Try this in the REPL on, say, OS X

>>> s = u'that\u2019s \U0001f63b'
>>> print s
that’s 

In python3, though -

Python 3.4.3 (default, Jul  7 2015, 15:40:07) 
>>> s  = u'that\u2019s \U0001f63b'
>>> s[-1]
''
pvg
  • 2,673
  • 4
  • 17
  • 31
  • Well would ya look at that... I really know nothing about nothing. Thanks! I'm still not clear how to read just that last character though. s[-1] and s[-2] still give '\ud83d' and '\ude3b'. Is there a way to read the string character by character? – Andrew LaPrise Jul 07 '15 at 22:28
  • 1
    @alaprise you're seeing an artifact of the way Python stores its Unicode strings internally. If you did the same thing in Python 3 you'd see something different entirely. – Mark Ransom Jul 07 '15 at 22:34
  • 2
    @alaprise The other answer has some good info, of which the summary is 'if possible move to Python3'. Otherwise you're entering a world of pain/surrogate pairs/words you don't want to know for they are the song of Cthulhu – pvg Jul 07 '15 at 22:43
  • 1
    '\ud83d' and '\ude3b' is a surrogate pair, used by UTF-16 to represent a code point above `U+FFFF`. This is a bug in Python 2, a lot of languages have that problem with those characters. – roeland Jul 07 '15 at 23:35
  • @roeland: `s[-1] == u'\U0001f63b'` on both Python 2 and 3 on my machine (["wide Python builds" are supported since 2001](https://www.python.org/dev/peps/pep-0261/)) – jfs Jul 11 '15 at 17:12
  • @alaprise: see [How to install python on Mac with wide-build](http://stackoverflow.com/q/25111521/4279) – jfs Jul 11 '15 at 17:12
  • I cant get this working with the warning sign: u'\U000026A0' - it comes out as a text glyph not emoji. – Jeef Sep 12 '16 at 14:57
3

Your last part of confusion is likely due to the fact that you are running what is called a "narrow Python build". Python can't hold a single character with enough information to hold a single emoji. The best solution would be to move to Python 3. Otherwise, try to process the UTF-16 surrogate pair.

Alyssa Haroldsen
  • 3,652
  • 1
  • 20
  • 35
  • `regex.findall(r'\X', unicode_text)` could be used to get "user-perceived characters" that may span more than one Unicode codepoint (it is unrelated to surrogate pairs but it should fix the issue as a side effect). – jfs Jul 11 '15 at 17:15