>>> c='中文'
>>> c
'\xe4\xb8\xad\xe6\x96\x87'
>>> len(c)
6
>>> cu=u'中文'
>>> cu
u'\u4e2d\u6587'
>>> len(cu)
2
>>> s=''
>>> s
'\xf0\xa4\xad\xa2'
>>> len(s)
4
>>> su=u''
>>> su
u'\U00024b62'
>>> len(su)
2
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.stdout.encoding
'UTF-8'
First, I want to make some concepts clear myself.
I've learned that unicode string like cu=u'中文'
,actually is encoded in UTF-16 by python shell default. Right? So, when we saw '\u*'
, that actually UTF-16 encoding
? And '\u4e2d\u6587'
is an unicode string or byte string? But cu
has to be stored in the memory, so
0100 1110 0010 1101 0110 0101 1000 0111
(convert \u4e2d\u6587 to binary) is the form that cu
preserved if that a byte string? Am I right?
But it can't be byte string. Otherwise len(cu) can't be 2, it should be 4!! So it has to be unicode string. BUT!!! I've also learned that
python attempts to implicitly encode the Unicode string with whatever scheme is currently set in sys.stdout.encoding, in this instance it's "UTF-8".
>>> cu.encode('utf-8')
'\xe4\xb8\xad\xe6\x96\x87'
So! how could len(cu)
== 2??? Is that because there are two '\u'
in it?
But that doesn't make len(su) == 2
sense!
Am I missing something?
I'm using python 2.7.12