3

The C-based Win32 API has dual versions of many functions to support both unicode (UTF-16) strings and older 8-bit codepage strings. The API also defines generic functions and types to abstract these away somewhat and allow compiling the two versions from the same codebase.

Microsoft recommends always using the generics (see Conventions for Function Prototypes) so you can compile both versions. But my question is - what versions of windows are we talking about supporting here, via the 8-bit string API? If it's Windows 95 it's not that high on my priorities anymore :). If the generics are only there to support extreme-legacy situations it would seem easier and clearer to just use the UTF-16 calls directly.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
QuadrupleA
  • 876
  • 6
  • 23
  • 5
    At this point, generics are there mostly to save you typing an extra character per function. `CreateWindow` may be easier to read and remember than `CreateWindowW`. It's also in all the books and tutorials and examples and such. – Igor Tandetnik Feb 11 '16 at 01:09
  • Curious about the downvotes - I don't know what version introduced the UTF-16 calls, and figured it was a legitimate question of possible use to others since MS doesn't explain it in their documentation. – QuadrupleA Feb 11 '16 at 01:17
  • @IgorTandetnik A more relevant point to me is to say: type the version without the `W`, but don't bother to test compiling against the 8-bit API. – o11c Feb 11 '16 at 02:04

1 Answers1

11

Wikipedia has a detailed article: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows

I'll start by explaining the three types of string text encoding that Windows supports:

  1. "ANSI" = 8-bit encoding - char in C. Despite the name "ANSI" this is not ASCII Not necessarily any specific ANSI encoding nor ASCII, but whatever the current locale/codepage is.
  2. "MBCS" = Multibyte character set encoding - char in C (where each byte is a char. The list of supported codepages is limited and crucially this does not include UTF-8. See this QA: Difference between MBCS and UTF-8 on Windows). MBCS is deprecated in modern Windows and should not be used in new projects.
  3. "Unicode" = UTF-16 (or UCS-2 in NT3/NT4) - wchar_t in C, which is is 16 bits.
    • Under UCS-2 only characters represented by codepoints 0x0000-0xFFFF can be used. Characters outside this range cannot be used.
    • In UTF-16 (Windows 2000 and later), codepoints after 0xFFFF are represented by 2 or more wchar_t values (4 or more bytes), known as surrogate-pairs, which means the binary length of a string (in bytes) is not necessarily directly proportional to the element length of the string (in size_t), which is also not necessarily the number of printed characters.
    • Also consider things like ligatures and diacritics also break the byte/element/printed-char count equivalence assumptions that programming code often makes.
    • As an example, consider this codepoint string: U+0061, U+0928, U+093F, U+4E9C, U+10083
      • In UTF-16 Big-Endian bytes, this is 00 61 09 28 09 3F 4E 9C D8 00 DC 83
      • which is 12 bytes
      • ...which is 6 wchar_t elements
      • ...but represents 5 characters
      • ...which are rendered as 4 printed characters (due to the ligature between the 2nd and 3rd characters).

Windows 3.x

Windows 3.x implements the 16-bit Windows API which is strictly 8-bit. It would use the current default locale and codepage settings if not running the en-US version. It does not support wide-characters, but presumably supports some kind of MBCS.

The "Win32 Subset" ("Win32s") add-on for Windows 3.x added some Win32 functions to Windows 3.x, including A and W functions, however the W functions are stubbed and return failure codes when called - the same behaviour as seen on the "full" Win32 on Windows 95.

Windows 95, 98, Me:

Win32, as implemented on Windows 9x (95, 98, Me), does not support UTF-16, only "ANSI" or "MBCS" strings.

An important note is that, as with Win32s, "Wide-character" functions do exist on Windows 9x, but they are stubbed out and when called will return failure codes - with the exception of a handful of Resource, Command-line and MessageBox-related functions which do handle UCS-2 character strings (e.g. MessageBoxExW).

Supporting non-Latin character sets requires messing around with codepages and the arcane "multibyte" encoding options (which, remember, do not support UTF-8 - except for MultiByteToWideChar and WideCharToMultiByte - which can be used for converting UTF-8 to UTF-16 for use with the Win32 API).

In 2001 Microsoft released an add-on for Windows 9x called the Microsoft Layer for Unicode (MSLU), which changed the W functions from failing-stubs to thunking proxies which converted the strings back into an 8-bit format and then called the A functions, so programs explicitly using W functions could run on Windows 9x.

Windows NT

Windows NT, from the very beginning with NT 3.1 (there was no 3.0 version) shipped with the W functions that accepted wchar_t-typed arguments. Strings were encoded with UCS-2 (NT3, NT4) or UTF-16 (NT5 and later). Microsoft products and documentation generally use "Unicode" as a shorthand for UTF-16 or UCS-2. Win32 does not support UTF-8 (excepting MultiByteToWideChar and WideCharToMultiByte) so the ambiguity is barely passable.

Windows CE

Windows CE 1.0 supported UTF-16 from the very start, consider its level of support equivalent to the Windows NT family's. I do not know when, or if ever, CE supported UCS-2 instead of UTF-16 or if it was UTF-16 from the very start.

So in short:

OS                          ANSI | MBCS | UTF-16 ("Unicode")
-----------------------------------------------------------------------------
Windows 3.x (Stock)         Yes  |   ?  | No
Windows 3.x (Win32s)        Yes  |   ?  | Stubbed, always fails (a)
Windows 95, 98, ME          Yes  | Yes  | Limited support (b) otherwise fails
Windows 95, 98, ME (MSLU)   Yes  | Yes  | Yes, runtime thunked to ANSI (c)

Windows NT 3.x, NT 4.x      Yes  | Yes  | As UCS-2 instead of UTF-16 (d)
Windows 2000 and later      Yes  | Yes  | Yes

Windows CE 1.0 and later    Yes  |   ?  | Yes (e)

(a): https://msdn.microsoft.com/en-us/library/cc194796.aspx
(b): https://support.microsoft.com/en-us/kb/210341 (c): https://en.wikipedia.org/wiki/Microsoft_Layer_for_Unicode
(d): https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
(e): http://www.hpcfactor.com/support/windowsce/wce1.asp

Conclusion: Just use "Unicode": even if you're writing software targeting Windows 9x you'll be fine as you can use the the Unicode Layer and it will still run (though things like Unicode filenames and window titles might be wonky). Your code will also be portable to Windows CE (I was surprised to see that Windows CE supported UTF-16 from the very start).

Community
  • 1
  • 1
Dai
  • 141,631
  • 28
  • 261
  • 374
  • Two minor points. 1) There do exist API functions that handle UTF-8 - two of them: `MultiByteToWideChar` and `WideCharToMultiByte`. 2) The correspondence between "Unicode codepoint" and "printed character" is quite loose - even in the absence of surrogate pairs, there are things like combining diacritics and ligatures. The metric you call "character length" - apparently, the number of Unicode codepoints - is not very useful in practice. The number of 16-bit units in UTF-16 representation, on the other hand, is useful, for memory allocation if nothing else. – Igor Tandetnik Feb 11 '16 at 02:40
  • @IgorTandetnik Thank you for the feedback, I have elaborated my answer. – Dai Feb 11 '16 at 02:53
  • Windows CE was released in 1996, same year with UTF-16 – phuclv Feb 11 '16 at 03:06
  • Not all `W` functions fail on Win9x/ME. There are a handful of `W` functions that are natively implemented and do work, such as `MessageBoxW()` (see [Unicode support in Windows 95 and Windows 98](https://support.microsoft.com/en-us/kb/210341) for the complete list). And don't forget, there is also the [Microsoft Layer for Unicode](https://msdn.microsoft.com/en-us/goglobal/bb688166.aspx) library, which adds support for many of the commonly used `W` functions to Win9x/ME. – Remy Lebeau Feb 12 '16 at 05:09
  • @RemyLebeau Thank you for that info, I've revised my answer. – Dai Feb 12 '16 at 05:50