0

I have read that the BOM is optional for UTF-8, but mandatory for UTF-16 and UTF-32.

But in which cases the BOM for UTF-16 and UTF-32 is mandatory?

What I mean is that there are many cases where I could be dealing with UTF-16 or UTF-32, for example:

  • If I am creating a UTF-16 or UTF-32 text file, should I include the BOM in the file?
  • If I am creating a C++ variable that holds a UTF-16 or UTF-32 string, should I include the BOM in the variable?
  • If I am transmitting a UTF-16 or UTF-32 string over a network, should I transmit the BOM with the string?
user9002947
  • 115
  • 3

4 Answers4

0

The Byte-Order Mark is used to identify the Byte Order for byte streams where there is no other way to figure it out. So, you use it whenever the byte streams you create might be used in a context where there is no other way to communicate the byte order.

For example, when transmitting a UTF-16 file over HTTP, you can transmit the byte order out-of-band in the Charset HTTP header. But, when reading a file from the filesystem, you cannot do that.

Jörg W Mittag
  • 363,080
  • 75
  • 446
  • 653
0

The only time you need a byte order mark is when the program reading the file does not already know what format it’s in. This often is the case with UTF-16, because it comes in two variants, big-endian and little-endian. UTF-16le is not supposed to be used to save data, but it’s the native format for Windows and therefore a lot of files use it anyway.

I’m not aware of any significant use of UCS-4 to save data files, but it would have the same endianness issue. (In theory, there are even more different byte orders that could apply to it, but the hardware that used them was obsolete long before the encoding was invented.)

UTF-8 has no such variants: there is only one UTF-8 format. However, a few programs, including Microsoft Visual Studio 2008 and 2010, can only detect UTF-8 with a byte order mark and also do not support the /UTF-8 option that later versions have, so there is no way to get that version of the compiler to understand UTF-8 without a BOM at all. Newer versions of MSVC can autodetect UTF-8 with no special flags if it has a BOM, and clang, gcc and icc will work just fine with the BOM, but clang in particular will not understand any encoding but UTF-8. Therefore, for C source files, UTF-8 with a BOM is the lowest common denominator for that collection of compilers.

Other than a few special cases like that, the consensus is that you should save your text files in UTF-8 without a BOM. A lot of other software will not understand a BOM, or will run into trouble if you concatenate multiple files that have BOMs. Furthermore, the different encodings of Unicode are easy to autodetect for real-world documents, and UTF-8 is the default in many contexts.

Davislor
  • 14,674
  • 2
  • 34
  • 49
  • 1
    Note that a "BOM", when it appears anywhere else but at the begin of a document retains its original meaning of a zero-width non-joining space. Tools that properly implement Unicode text have no problem with this at all; since when concatenating files you generally don't intend to join words. The problem appears with tools that interpret UTF-8 as some kind of "8-bit ASCII", but they generally are unsuitable to work with Unicode anyway. – MSalters Apr 23 '18 at 10:48
  • @MSalters As you say, “Tools that properly implement Unicode text ....” And that is nearly all recent software. Some older programs don’t, or sometimes I need to run on a non-UTF-8 locale that confuses some tool. – Davislor Apr 23 '18 at 16:16
0

According Unicode, BOM is mandatory if you use encoding schema UTF-16 and encoding schema UTF-32. It must not be used [Unicode, Table 2-4], if you have "encoding schema UTF-16LE" or "encoding schema UTF-16BE" (and similar for the 32 bit variants).

So, if you can specify which encoding schema you are using (e.g. UTF-16BE or UTF-16GE), you may use such format (so with full byte-order indication), but without BOM.

But in general you cannot specify the byte order or full encoding schema, e.g. for files as property of in filesystem. So you may use them as UTF-16 and so you must use the BOM. In such case, it will be4 good to use UTF-16 also when transmitting (so where you could specify the byte-order): it is simpler to have just one version.

Giacomo Catenazzi
  • 8,519
  • 2
  • 24
  • 32
0

If I am creating a UTF-16 or UTF-32 text file, should I include the BOM in the file?

It's probably a good idea to include the BOM if the text file will be used with a variety of applications. If it's just a data file for one particular application that you control, you might not bother.

I would consider using a BOM even with UTF-8 to interoperate well with applications on Windows. Most Posix modern applications seem to understand how to deal with a BOM in UTF-8.

If I am creating a C++ variable that holds a UTF-16 or UTF-32 string, should I include the BOM in the variable?

Probably not.

If the variable is an array or std::basic_string of wchar_t, char16_t or the like, then the byte order is irrelevant. The CPU will use whatever byte order it finds most natural. You only have to worry about the BOM when importing or exporting that string data.

If the variable is an array or std::vector of bytes (e.g., uchar8_t) that's holding UTF-encoded text, then you might consider including a BOM or supplementing the array or vector with another piece of metadata the describes the encoding.

If I am transmitting a UTF-16 or UTF-32 string over a network, should I transmit the BOM with the string?

That will depend on the protocol.

Most modern protocols use UTF-8. Some protocols might use an explicit metadata field to specify which encoding (and byte order) is in use rather than rely on a BOM.

If you're defining a new protocol, I would suggest making it UTF-8 without a BOM.

Adrian McCarthy
  • 45,555
  • 16
  • 123
  • 175