0

If I do:

print "\xE2\x82\xAC"
print len("€")
print len(u"€")

I get:

€
3
1

But if I do:

print '\xf0\xa4\xad\xa2'
print len("")
print len(u"")

I get:


4
2

In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"".

Can someone explain to me why this is the case?

lessthanl0l
  • 1,035
  • 2
  • 12
  • 21

1 Answers1

2

Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.

Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'') == 1

Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('') == 1.

str in Python 3.0 to 3.2 is the same as unicode in Python 2.

Karol S
  • 9,028
  • 2
  • 32
  • 45
  • How can I loop through an unicode character string that contains this kind of encoding? some thing like u"". – lessthanl0l Jul 19 '14 at 22:22
  • @lessthanl0l: Try something like this: http://stackoverflow.com/questions/7494064/how-to-iterate-over-unicode-characters-in-python-3 – Karol S Jul 21 '14 at 14:21