You already got your answer I just wanted to explain it in my own words for future readers.
In UTF-16 encoding, It seems that 'a'
should occupy 16 bits or 2 bytes. The 'a'
itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a'
or after it? There are two possible ways:
First: 01100001|00000000
Second: 00000000|01100001
If I don't tell you anything and just hand you these, this would happen:
First = b"0110000100000000"
print(hex(int(First, 2))) # 0x6100
print(chr(int(First, 2))) # 愀
Second = b"0000000001100001"
print(hex(int(Second, 2))) # 0x61
print(chr(int(Second, 2))) # a
So you can't say anything just by looking at these bytes. Did I mean to send you 愀
or a
?
First Solution:
I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:
bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le")) # a - Correct.
print(bytes_.decode("utf-16-be")) # 愀
So If I tell you about the endianness, you can get to the correct character.
You see, without any extra character we were able to achieve this.
Second Solution
I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).
ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"
print((ordering1 + b"\x00a").decode("utf-16")) # a
print((ordering2 + b"a\x00").decode("utf-16")) # a
Now just passing "utf-16"
to .decode()
is enough. It can figure the correct byte out correctly. There is no need to tell about le
or be
it's already there.