4

When I do:

text = u"奥巴马讲话"
for c in text:
    print c

I got the expected result:

奥
巴
马
讲
话

But if I do:

text = u"€"
for c in text:
    print c

I got:

�
�
€

I'm expecting to get:

Why is this? I think it has something to do with the following fact:

In [1]: u"".encode("utf8")
Out[1]: '\xf0\xa4\xad\xa2'

"" is encoded using 4 bytes.

How can I loop through an unicode string that has this kind of encoding?

Something like u"".

lessthanl0l
  • 1,035
  • 2
  • 12
  • 21

1 Answers1

3

is outside the Basic Multilingual Plane; it has codepoint U+24B62. This means that to process it correctly you need a Python build that has sys.maxunicode == 1114111. See Unicode in Python - just UTF-16? for more details.

If you can, upgrade to Python 3.3 where this is all handled correctly. Otherwise you will need to implement UTF-16 handling yourself by pairing up low and high surrogate codepoints: How to iterate over Unicode characters in Python 3?

Community
  • 1
  • 1
ecatmur
  • 152,476
  • 27
  • 293
  • 366