Wikipedia has a detailed article: https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
I'll start by explaining the three types of string text encoding that Windows supports:
- "ANSI" = 8-bit encoding -
char
in C. Despite the name "ANSI" this is not ASCII Not necessarily any specific ANSI encoding nor ASCII, but whatever the current locale/codepage is.
- "MBCS" = Multibyte character set encoding -
char
in C (where each byte is a char
. The list of supported codepages is limited and crucially this does not include UTF-8. See this QA: Difference between MBCS and UTF-8 on Windows). MBCS is deprecated in modern Windows and should not be used in new projects.
- "Unicode" = UTF-16 (or UCS-2 in NT3/NT4) -
wchar_t
in C, which is is 16 bits.
- Under UCS-2 only characters represented by codepoints
0x0000
-0xFFFF
can be used. Characters outside this range cannot be used.
- In UTF-16 (Windows 2000 and later), codepoints after
0xFFFF
are represented by 2 or more wchar_t
values (4 or more bytes), known as surrogate-pairs, which means the binary length of a string (in bytes) is not necessarily directly proportional to the element length of the string (in size_t
), which is also not necessarily the number of printed characters.
- Also consider things like ligatures and diacritics also break the byte/element/printed-char count equivalence assumptions that programming code often makes.
- As an example, consider this codepoint string:
U+0061, U+0928, U+093F, U+4E9C, U+10083
- In UTF-16 Big-Endian bytes, this is
00 61 09 28 09 3F 4E 9C D8 00 DC 83
- which is 12 bytes
- ...which is 6
wchar_t
elements
- ...but represents 5 characters
- ...which are rendered as 4 printed characters (due to the ligature between the 2nd and 3rd characters).
Windows 3.x
Windows 3.x implements the 16-bit Windows API which is strictly 8-bit. It would use the current default locale and codepage settings if not running the en-US version. It does not support wide-characters, but presumably supports some kind of MBCS.
The "Win32 Subset" ("Win32s") add-on for Windows 3.x added some Win32 functions to Windows 3.x, including A
and W
functions, however the W
functions are stubbed and return failure codes when called - the same behaviour as seen on the "full" Win32 on Windows 95.
Windows 95, 98, Me:
Win32, as implemented on Windows 9x (95, 98, Me), does not support UTF-16, only "ANSI" or "MBCS" strings.
An important note is that, as with Win32s, "Wide-character" functions do exist on Windows 9x, but they are stubbed out and when called will return failure codes - with the exception of a handful of Resource, Command-line and MessageBox-related functions which do handle UCS-2 character strings (e.g. MessageBoxExW
).
Supporting non-Latin character sets requires messing around with codepages and the arcane "multibyte" encoding options (which, remember, do not support UTF-8 - except for MultiByteToWideChar
and WideCharToMultiByte
- which can be used for converting UTF-8 to UTF-16 for use with the Win32 API).
In 2001 Microsoft released an add-on for Windows 9x called the Microsoft Layer for Unicode (MSLU), which changed the W
functions from failing-stubs to thunking proxies which converted the strings back into an 8-bit format and then called the A
functions, so programs explicitly using W
functions could run on Windows 9x.
Windows NT
Windows NT, from the very beginning with NT 3.1 (there was no 3.0 version) shipped with the W
functions that accepted wchar_t
-typed arguments. Strings were encoded with UCS-2 (NT3, NT4) or UTF-16 (NT5 and later). Microsoft products and documentation generally use "Unicode" as a shorthand for UTF-16 or UCS-2. Win32 does not support UTF-8 (excepting MultiByteToWideChar
and WideCharToMultiByte
) so the ambiguity is barely passable.
Windows CE
Windows CE 1.0 supported UTF-16 from the very start, consider its level of support equivalent to the Windows NT family's. I do not know when, or if ever, CE supported UCS-2 instead of UTF-16 or if it was UTF-16 from the very start.
So in short:
OS ANSI | MBCS | UTF-16 ("Unicode")
-----------------------------------------------------------------------------
Windows 3.x (Stock) Yes | ? | No
Windows 3.x (Win32s) Yes | ? | Stubbed, always fails (a)
Windows 95, 98, ME Yes | Yes | Limited support (b) otherwise fails
Windows 95, 98, ME (MSLU) Yes | Yes | Yes, runtime thunked to ANSI (c)
Windows NT 3.x, NT 4.x Yes | Yes | As UCS-2 instead of UTF-16 (d)
Windows 2000 and later Yes | Yes | Yes
Windows CE 1.0 and later Yes | ? | Yes (e)
(a): https://msdn.microsoft.com/en-us/library/cc194796.aspx
(b): https://support.microsoft.com/en-us/kb/210341
(c): https://en.wikipedia.org/wiki/Microsoft_Layer_for_Unicode
(d): https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows
(e): http://www.hpcfactor.com/support/windowsce/wce1.asp
Conclusion: Just use "Unicode": even if you're writing software targeting Windows 9x you'll be fine as you can use the the Unicode Layer and it will still run (though things like Unicode filenames and window titles might be wonky). Your code will also be portable to Windows CE (I was surprised to see that Windows CE supported UTF-16 from the very start).