len() with unicode strings

Question

If I do:

print "\xE2\x82\xAC"
print len("€")
print len(u"€")

I get:

€
3
1

But if I do:

print '\xf0\xa4\xad\xa2'
print len("")
print len(u"")

I get:

4
2

In the second example, the len() function returned 2 instead of 1 for the one character unicode string u"".

Can someone explain to me why this is the case?

Karol S · Accepted Answer · 2014-07-19T23:39:34.483

2

Python 2 can use UTF-16 as internal encoding for unicode objects (so called "narrow" build), which means is being encoded as two surrogates: D852 DF62. In this case, len returns the number of UTF-16 units, not the number of actual Unicode codepoints.

Python 2 can also be compiled with UTF-32 enabled for unicode (so called "wide" build), which means most unicode objects take twice as much memory, but then len(u'') == 1

Python 3's str objects since 3.3 switch on demand between ISO-8859-1, UTF-16 and UTF-32, so you'd never encounter this problem: len('') == 1.

str in Python 3.0 to 3.2 is the same as unicode in Python 2.

edited Jul 19 '14 at 23:39

answered Jul 19 '14 at 17:44

Karol S

9,028
2
32
45

How can I loop through an unicode character string that contains this kind of encoding? some thing like u"". – lessthanl0l Jul 19 '14 at 22:22
@lessthanl0l: Try something like this: http://stackoverflow.com/questions/7494064/how-to-iterate-over-unicode-characters-in-python-3 – Karol S Jul 21 '14 at 14:21

len() with unicode strings

1 Answers1