0

I don't quite understand the principles behind UTF encodings and BOM.

What is the point of having BOM in UTF-16 and UTF-32 if computers already know how to compose multibyte data types (for example, integers with the size of 4 bytes) into one variable? Why do we need to specify it explicitly for these encodings then?

And why don't we need to specify it for UTF-8? Unicode standard says that it's "byte oriented" but even then we need to know whether it is the first byte of the encoded code point or not. Or does it specified in the first / last bits of every character?

FrozenHeart
  • 19,844
  • 33
  • 126
  • 242
  • Theoretically the file could have been created on a big-endian machine. The bytes will be in the wrong order, without the BOM a program has no way to guess at that and is sure to get it wrong. Utf-8 does not have an edianness dependency. It is mostly theoretical today, there are not a lot of big-endian machines left. Most practically, a program can use the BOM to auto-detect the encoding. – Hans Passant Jan 20 '16 at 18:09

3 Answers3

4

UTF-16 is two byte wide, lets call that bytes B0|B1. Let's say we have letter 'a' this is logically number 0x0061. Unfortunately different computer architectures store this number in different ways in memory, on x86 platform less significant byte is stored first (at lower memory address) so 'a' will be stored as 00|61. On PowerPC this will be stored as 61|00, these two architectures are called little endian and big endian for that reason.

To speed up string processing libraries generally store two bytes characters in native order (big ending or little endian). Swapping bytes would be too expensive.

Now imagine that someone on PowerPC writes string to a file, library will write bytes 00|61, now someone on x86 will want to read this bytes but does it mean 00|61 or maybe 61|00? We can put special sequence at the beginning of the string so anyone will know byte order used to save string, and process it correctly (converting string between endian's is a costly operation, but most of the time x86 string will be read on x86 arch, and PowerPC string on PowerPC machines)

With UTF-8 this is different story, UTF-8 uses single order and encodes character length into pattern of first bits of first character. UTF-8 encoding is well described on Wikipedia. Generally speaking it was designed to avoid problem with endian'ess

csharpfolk
  • 4,124
  • 25
  • 31
  • 1
    Code points can be represented in UTF-8 via up to 4 bytes. How will the other computer know how it should be composed -- from left to right or from right to left? – FrozenHeart Jan 20 '16 at 18:27
  • @FrozenHeart actually up to 6 bytes – csharpfolk Jan 20 '16 at 18:35
  • "actually up to 6 bytes" -- Are you sure? – FrozenHeart Jan 20 '16 at 18:40
  • 1
    @FrozenHeart From Wikipedia: In November 2003, UTF-8 was restricted by RFC 3629 to end at U+10FFFF, in order to match the constraints of the UTF-16 character encoding. This removed all 5- and 6-byte sequences, and 983040 4-byte sequences. Original was up to 6 – csharpfolk Jan 20 '16 at 18:43
  • According to https://www.w3.org/International/questions/qa-byte-order-mark the byte 0x00 would be the more significant, the "big end" and 0x61 the "little end". Therefore 0x00|0x61 is big endian and 0x61|00 is small endian encoding for 'a'. – Dietrich Baumgarten Apr 25 '20 at 08:09
3

Different architectures can encode things differently. One system might write 0x12345678 as 0x12 0x34 0x56 0x78 and another might write it as 0x78 0x56 0x34 0x12. It's important to have a way of understanding how the source system has written things. Bytes are the smallest units read or written, so if a format is written byte-by-byte, there is not a problem, just like no system has trouble reading an ASCII file written by another.

The UTF-16 BOM, U+FEFF will either be written as 0xFE 0xFF or 0xFF 0xFE, depending on the system. Knowing in which order those bytes are written tells the reader which order the bytes will be in for the rest of the file. UTF-32 uses the same BOM character, padded with 16 zero bits, but its use is the same.

UTF-8, on the other hand, is designed to be read a byte at a time. Therefore, the order is the same on all systems, even when dealing with mutli-byte characters.

John Sensebe
  • 1,386
  • 8
  • 11
  • 1
    Sorry, but I still don't understand why do we have to specify byte order for UTF-16 and UTF-32 encoded text and why don't we have to do it in case of UTF-8. What's the difference? – FrozenHeart Jan 20 '16 at 18:17
  • The UTF-16 BOM, `U+FEFF` wil either be written as 0xFE 0xFF or 0xFF 0xFE, depending on the system. Knowing which order those bytes are written tells the reader which order the bytes will be in for the rest of the file. UTF-32 uses the same BOM character, padded with 16 zero bits, but its use is the same. – John Sensebe Jan 20 '16 at 18:25
  • 1
    Code points can be represented in UTF-8 via up to 4 bytes. How will the other computer know how it should be composed -- from left to right or from right to left? – FrozenHeart Jan 20 '16 at 18:28
  • Those bytes will be in the same order, as per the spec. Multi-byte UTF-8 characters always start with a "leading byte" in which the top two bits are set. – John Sensebe Jan 20 '16 at 18:30
1

The UTF-16 and UTF-32 encodings do not specify a byte order. In a stream of 8-bit bytes, the code point U+FEFF can be encoded in UTF-16 as the bytes FE, FF (big endian) or as FF, FE (little endian). The stream writer obviously cannot know where the stream will end up (a file, a network socket, a local program?) so you put a BOM at the beginning to help the reader(s) determine the encoding and byte-order variant.

UTF-8 does not have this ambiguity because it is a byte-oriented encoding right from the start. The only way to encode this code point in UTF-8 is with the bytes EF, BB, BF in this precise order. (Conveniently, the high bits in the first byte of the serialization also reveals how many bytes the sequence will occupy.)

tripleee
  • 175,061
  • 34
  • 275
  • 318