2

I was just learning about encoding strings in python and after fidgeting with it a little, I got confused by the fact that the size of an empty string ('') is 0 in utf 8 and ascii but somehow 2 in utf 16? how come?

print(len(''.encode('utf16'))) # is 2
print(len(''.encode('utf8'))) # is 0

I guess a big part of the problem is that I don't understand how utf 16 works. I don't understand why encoding 'spam' in utf 16 would be 10 bytes long instead of just 8 bytes (2 bytes (16 bits) for each character). I'm assuming that the 2 bytes are needed in utf 16 as default for any string for padding or something?

*edit

I am NOT confused about the basics of how UTF 8 or UTF 16 work and differ in storing each individual characters. I am confused about how the absence of any characters (an empty string) would be stored in 2 bytes in UTF 16 but have 0 bytes in UTF 8. (as opposed to 1 byte or 0 for both)

The link does not provide answer to my question.

1 Answers1

5

By default, Python includes a Byte Order Mark when encoding to UTF-16, but not when encoding to UTF-8.

>>> ''.encode('utf16')
b'\xff\xfe'
>>> ''.encode('utf8')
b''

You can suppress the BOM by explicitly specifying the byte order with a BE (Big-Endian) or LE (Little-Endian) suffix.

>>> ''.encode('utf-16-le')
b''
dan04
  • 87,747
  • 23
  • 163
  • 198