odd behavior while looping through an unicode string

Question

When I do:

text = u"奥巴马讲话"
for c in text:
    print c

I got the expected result:

奥
巴
马
讲
话

But if I do:

text = u"€"
for c in text:
    print c

I got:

�
�
€

I'm expecting to get:

€

Why is this? I think it has something to do with the following fact:

In [1]: u"".encode("utf8")
Out[1]: '\xf0\xa4\xad\xa2'

"" is encoded using 4 bytes.

How can I loop through an unicode string that has this kind of encoding?

Something like u"".

score 3 · Accepted Answer · edited May 23 '17 at 12:30

is outside the Basic Multilingual Plane; it has codepoint U+24B62. This means that to process it correctly you need a Python build that has sys.maxunicode == 1114111. See Unicode in Python - just UTF-16? for more details.

If you can, upgrade to Python 3.3 where this is all handled correctly. Otherwise you will need to implement UTF-16 handling yourself by pairing up low and high surrogate codepoints: How to iterate over Unicode characters in Python 3?

odd behavior while looping through an unicode string

1 Answers1