20

There are different encodings of the same Unicode (standardized) table. For example for UTF-8 encoding A corresponds to 0x0041 but for UTF-16 encoding the same A is represented as 0xfeff0041.

From this brilliant article I have learned that when I program by C++ for Windows platform and I deal with Unicode that I should know that it is represented in 2 bytes. But it does not say anything about the encoding. (Even it says that x86 CPUs are little-endian so I know how those two bytes are stored in memory.) But I should also know the encoding of the Unicode so that I have a complete information about how the symbols are stored in memory. Is there any fixed Unicode encoding for C++/Windows programmers?

dda
  • 6,030
  • 2
  • 25
  • 34
Narek
  • 38,779
  • 79
  • 233
  • 389
  • 12
    `A` is **not** represented as `0xfeff0041` in UTF-16. It is `0x41` in UTF-8 and `0x0041` in UTF-16. – Remy Lebeau Nov 21 '12 at 18:42
  • http://www.fileformat.info/info/charset/UTF-16/list.htm here is the source of my info, as I have mentioned already. So how it is stored? – Narek Nov 21 '12 at 18:43
  • 5
    Your source is wrong. All of those values should not have `feff` in front of them. `0xFEFF` is used as a UTF-16 BOM. – Remy Lebeau Nov 21 '12 at 18:45
  • 4
    @Narek 0xfeff is the *byte order mark*. That table is just telling you what order the following two bytes are in. If you go to [the page for `A`](http://www.fileformat.info/info/unicode/char/0041/index.htm), you'll see the UTF-8 encoding is 0x41 and the UTF-16 encoding is 0x0041. – Joseph Mansfield Nov 21 '12 at 18:46
  • Am I right that all unicode encodings point onto the same symbol by the same code(0x41 = 0x0041 = ...)? – Narek Nov 21 '12 at 18:49
  • @Narek: Not necessarily. If you want UTF-16 character 0x1234 (whatever that is) in UTF-8, the UTF-8 character isn't 0x1234. For characters with a UTF-8 value < 128, yes, I believe it maps the same for UTF-16 and UTF-32. – Cornstalks Nov 21 '12 at 18:53
  • `0x41` is the UTF-8 encoding of `A` and `0x0041` is the UTF-16 encoding of `A`. `A` is a simple example where both UTF-8 and UTF-16 encodings are similar. This is not necessarily the case. – Joseph Mansfield Nov 21 '12 at 18:53
  • 3
    UTF-8 encodes Unicode codepoints using 1, 2, 3, or 4 bytes, depending on value. UTF-16 encodes Unicode codepoints using either 2 or 4 bytes, depending on value. Only the ASCII codepoints (0x00-0x7F) have the same value in both UTF-8 and UTF-16 encodings. Codepoints 0x80 and higher are encoded differently otherwise. – Remy Lebeau Nov 21 '12 at 23:58
  • This article says all there is to say about encodings on Windows: http://www.utf8everywhere.org. Also, why not to use widechars. – Pavel Radzivilovsky Nov 22 '12 at 09:37
  • To quibble, Windows does not use any specific encoding for its data types, because Windows, as an operating system, does not have any data types per se. The C Windows **API** uses particular data types for its functions, but that's just an API. Other APIs, such as .NET or the Windows Runtime for JavaScript, may use different representations and encodings, and may not have any particular "default" encoding. – Dan Korn Mar 09 '18 at 23:37

1 Answers1

27

The values stored in memory for Windows are UTF-16 little-endian, always. But that's not what you're talking about - you're looking at file contents. Windows itself does not specify the encoding of files, it leaves that to individual applications.

The 0xfe 0xff you see at the start of the file is a Byte Order Mark or BOM. It not only indicates that the file is most probably Unicode, but it tells you which variant of Unicode encoding.

0xfe 0xff      UTF-16 big-endian
0xff 0xfe      UTF-16 little-endian
0xef 0xbb 0xbf UTF-8

A file that doesn't have a BOM should be assumed to be 8-bit characters unless you know how it was written. That still doesn't tell you if it's UTF-8 or some other Windows character encoding, you'll just have to guess.

You may use Notepad as an example of how this is done. If the file has a BOM then Notepad will read it and process the contents appropriately. Otherwise you must specify the coding yourself with the "Encoding" dropdown list.

Edit: the reason Windows documentation isn't more specific about the encoding is that Windows was a very early adopter of Unicode, and at the time there was only one encoding of 16 bits per code point. When 65536 code points were determined to be inadequate, surrogate pairs were invented as a way to extend the range and UTF-16 was born. Microsoft was already using Unicode to refer to their encoding and never changed.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • 12
    **"The values stored in memory for Windows are UTF-16 little-endian, always."** This is what I need! Thanks a lot! Just I wonder is it somewhere documented? – Narek Nov 21 '12 at 19:01
  • 4
    @Narek, here's a reference: http://msdn.microsoft.com/en-us/library/windows/desktop/dd374081(v=vs.85).aspx. Quote: "Typically, a Windows application should use UTF-16 internally, converting only as part of a "thin layer" over the interface that must use another format." The fact that it's little-endian isn't specified by Windows but rather the fact that it's a little-endian Intel processor. – Mark Ransom Nov 21 '12 at 19:06
  • @RemyLebeau, the article makes the case that Notepad guesses about as well as can be expected when it doesn't find a BOM. My suggestion was not to guess but let the user decide, and Notepad (at least in Win7) gives you that option too. – Mark Ransom Nov 22 '12 at 03:06
  • 1
    Again, to quibble, it's not true that values stored in memory for Windows are UTF-16 little endian always. You can store any value you want, in any encoding, in memory, in a Windows application. It's up to each program accessing that memory how it wants to deal with it. Many of the Windows API functions use UTF-16, but that's just one API. – Dan Korn Mar 09 '18 at 23:44
  • @DanKorn when I say "for Windows" I mean for use in the Windows API, and indeed for most other Microsoft APIs too. – Mark Ransom Mar 09 '18 at 23:49
  • @MarkRansom Can you provide a reference for the fact that Windows APIs always use little endian? Particularly even in the case when the underlying CPU is big endian? – BurntSushi5 May 17 '19 at 21:29
  • @BurntSushi5 can you tell me which big endian CPU is capable of running Windows? – Mark Ransom May 18 '19 at 00:39
  • I don't know. I'm asking from a place of ignorance. :) – BurntSushi5 May 18 '19 at 08:54