len(unicode string)

Question

>>> c='中文'
>>> c
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(c)
6
>>> cu=u'中文'
>>> cu
u'\u4e2d\u6587'
>>> len(cu)
2
>>> s=''
>>> s
'\xf0\xa4\xad\xa2'
>>> len(s)
4
>>> su=u''
>>> su
u'\U00024b62'
>>> len(su)
2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'

First, I want to make some concepts clear myself. I've learned that unicode string like cu=u'中文' ,actually is encoded in UTF-16 by python shell default. Right? So, when we saw '\u*' , that actually UTF-16 encoding? And '\u4e2d\u6587' is an unicode string or byte string? But cu has to be stored in the memory, so

0100 1110 0010 1101 0110 0101 1000 0111

(convert \u4e2d\u6587 to binary) is the form that cu preserved if that a byte string? Am I right?

But it can't be byte string. Otherwise len(cu) can't be 2, it should be 4!! So it has to be unicode string. BUT!!! I've also learned that

python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8".

>>> cu.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'

So! how could len(cu) == 2??? Is that because there are two '\u' in it?

But that doesn't make len(su) == 2 sense!

Am I missing something?

I'm using python 2.7.12

characters != bytes. a utf16 character is 2 bytes, but only one character. — Marc B, Oct 03 '16 at 16:03
You want to read up about Unicode *first*. See http://nedbatchelder.com/text/unipain.html — Martijn Pieters, Oct 03 '16 at 16:03
@MarcB So , len(ob) is not just return how many bytes ob are or what? — MMMMMCCLXXVII, Oct 03 '16 at 16:34

Martijn Pieters · Answer 1 · 2016-10-03T16:12:18.720

The Python unicode type holds Unicode codepoints, and is not meant to be an encoding. How Python does this internally is an implementation detail and not something you need to be concerned with most of the time. They are not UTF-16 code units, because UTF-16 is another codec you can use to encode Unicode text, just like UTF-8 is.

The most important thing here is that a standard Python str object holds bytes, which may or may not hold text encoded to a certain codec (your sample uses UTF-8 but that's not a given), and unicode holds Unicode codepoints. In an interactive interpreter session, it is the codec of your terminal that determines what bytes are received by Python (which then uses sys.stdin.encoding to decode these as needed when you create a u'...' unicode object).

Only when writing to sys.stdout (say, when using print) does the sys.stdout.encoding value come to play, where Python will automatically encode your Unicode strings again. Only then will your 2 Unicode codepoints be encoded to UTF-8 again and written to your terminal, which knows how to interpret those.

You probably want to read up about Python and Unicode, I recommend:

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

Come up another question. As you mentioned, only when use `print` , python will automatically encode unicode string again. So the bash shell will decode it again to unicode when receive it in order to display on the screen and use that unicode to match something so-called rendered or something I'm not sure, or just leave it to glyph. So Can I simply think that if I want display something on screen, Unicode is the last form ? Uh, can you get my point? — MMMMMCCLXXVII, Oct 03 '16 at 16:30

len(unicode string)

1 Answers1