-2
print(bytes('ba', 'utf-16'))

Result :

b'\xff\xfeb\x00a\x00'

I understand utf-16 means every character will take 16 bits means 00000000 00000000 in binary and i understand there are 16 bits here x00a means x00 = 00000000 and a = 01000001 so both gives x00a it is clear to my mind like this but here is the confusion:

\xff\xfeb

1 - What is this ?????????

2 - Why fe ??? it should be x00

i have read a lot of wikipedia articles but it is still not clear

S.B
  • 13,077
  • 10
  • 22
  • 49

3 Answers3

1

I think you are misinterpreting the printout.

You have 3 16-bit words:

  • FFFE: This is the byte-order mark required in UTF-16 (Byte order mark - Wikipedia).
  • 00, followed by the 8-bit encoding of 'b' (that is shown as the character 'b' instead of using an \x escape sequence): This is the 16-bit representation of 'b'.
  • 00, followed by the 8-bit encoding of 'a': This is the 16-bit representation of 'a'.
Darryl Noakes
  • 2,207
  • 1
  • 9
  • 27
Fulvio Corno
  • 146
  • 2
  • 5
  • why 3 ??? there are only two a and b where are 3 ??????? –  Nov 01 '22 at 16:04
  • It's simply as he says. There is an extra word inserted at the start: the byte-order mark required by UTF-16. It's part of the UTF-16 encoding, not the source string itself. – Darryl Noakes Nov 01 '22 at 16:14
  • Yes, the 1st 16-bit word is included in the UTF-16 encoding rule. For the lengthy details see https://docs.python.org/3/library/codecs.html#standard-encodings (look for discussions around BOM). – Fulvio Corno Nov 01 '22 at 16:20
1

You have,

b'\xff\xfeb\x00a\x00'

This is what you asked for, it has three characters.

b'\xff\xfe' # 0xff 0xfe
b'b\x00'    # 0x62 0x00
b'a\x00'    # 0x61 0x00

The first is U+FEFF (byte order mark), the second is U+0062 (b), and the third is U+0061 (a). The byte order mark is there to distinguish between little-endian UTF-16 and big-endian UTF-16. It is normal to find a BOM at the beginning of a UTF-16 document.

It is just confusing to read because the 'b' and 'a' look like they're hexadecimal digits, but they're not.

If you don't want the BOM, you can use utf-16le or utf-16be.

>>> bytes('ba', 'utf-16le')
b'b\x00a\x00'
>>> bytes('ba', 'utf-16be')
b'\x00b\x00a'

The problem is that you can get some garbage if you decode as the wrong one. If you use UTF-16 with BOM, you're more likely to get the right result when decoding.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
0

You already got your answer I just wanted to explain it in my own words for future readers.

In UTF-16 encoding, It seems that 'a' should occupy 16 bits or 2 bytes. The 'a' itself needs 8 bits. The question is should I put the remaining zeroes before the value of 'a' or after it? There are two possible ways:

First: 01100001|00000000
Second: 00000000|01100001

If I don't tell you anything and just hand you these, this would happen:

First = b"0110000100000000"
print(hex(int(First, 2)))   # 0x6100
print(chr(int(First, 2)))   # 愀

Second = b"0000000001100001"
print(hex(int(Second, 2)))  # 0x61
print(chr(int(Second, 2)))  # a

So you can't say anything just by looking at these bytes. Did I mean to send you or a ?

First Solution:

I myself tell you about this verbally. About the "Ordering"! Here is where "big-endian" and "little-endian" come into play:

bytes_ = b"a\x00" # >>>>>> Please decode it with "Little-Endian"!
print(bytes_.decode("utf-16-le"))  # a - Correct.
print(bytes_.decode("utf-16-be"))  # 愀 

So If I tell you about the endianness, you can get to the correct character.

You see, without any extra character we were able to achieve this.

Second Solution

I can "embed" the byte ordering into the bytes itself without explicitly telling you! It is called BOM(Byte Order Mark).

ordering1 = b"\xfe\xff"
ordering2 = b"\xff\xfe"

print((ordering1 + b"\x00a").decode("utf-16"))  # a
print((ordering2 + b"a\x00").decode("utf-16"))  # a

Now just passing "utf-16" to .decode() is enough. It can figure the correct byte out correctly. There is no need to tell about le or be it's already there.

S.B
  • 13,077
  • 10
  • 22
  • 49