I have a Python 2.7 program which reads iOS text messages from a SQLite database. The text messages are unicode strings. In the following text message:
u'that\u2019s \U0001f63b'
The apostrophe is represented by \u2019
, but the emoji is represented by \U0001f63b
. I looked up the code point for the emoji in question, and it's \uf63b
. I'm not sure where the 0001
is coming from. I know comically little about character encodings.
When I print the text, character by character, using:
s = u'that\u2019s \U0001f63b'
for c in s:
print c.encode('unicode_escape')
The program produces the following output:
t
h
a
t
\u2019
s
\ud83d
\ude3b
How can I correctly read these last characters in Python? Am I using encode correctly here? Should I just attempt to trash those 0001
s before reading it, or is there an easier, less silly way?