The internal representation varies from latin-1, UCS-2 to UCS-4
. UCS means that the representaion is 2 or 4 bytes long and the unicode code-units are numerically equal to the corresponding code-points. We can check this by finding where the sizes of the code units change.
To show that they range from 1 byte of latin-1 to to 4 bytes of UCS-4:
>>> getsizeof('')
49
>>> getsizeof('a') #------------------ + 1 byte as the representaion here is latin-1
50
>>> getsizeof('\U0010ffff')
80
>>> getsizeof('\U0010ffff\U0010ffff') # + 4 bytes as the representation here is UCS-4
84
We can check that in the beginning representation is indeed latin-1 and not UTF-8 as the change to 2-byte code unit happens at the byte boundary and not at ''\U0000007f'
- '\U00000080'
boundary as in UTF-8:
>>> getsizeof('\U0000007f')
50
>>> getsizeof('\U00000080') #----------The size of the string changes at \x74 - \x80 boundary but..
74
>>> getsizeof('\U00000080\U00000080') # ..the size of the code-unit is still one. so not UTF-8
75
>>> getsizeof('\U000000ff')
74
>>> getsizeof('\U000000ff\U000000ff')# (+1 byte)
75
>>> getsizeof('\U00000100')
76
>>> getsizeof('\U00000100\U00000100') # Size change at byte boundary(+2 bytes). Rep is UCS-2.
78
>>> getsizeof('\U0000ffff')
76
>>> getsizeof('\U0000ffff\U0000ffff') # (+ 2 bytes)
78
>>> getsizeof('\U00010000')
80
>>> getsizeof('\U00010000\U00010000') # (+ 4 bytes) Thes size of the code unit changes to 4 at byte boundary again.
84