What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?

Question

In some Windows 10 builds (insiders starting April 2018 and also "normal" 1903) there is a new option called "Beta: Use Unicode UTF-8 for worldwide language support".

You can see this option by going to Settings and then: All Settings -> Time & Language -> Language -> "Administrative Language Settings"

This is what it looks like:

When this checkbox is checked I observe some irregularities (below) and I would like to know what exactly this checkbox does and why the below happens.

Create a brand new Windows Forms application in your Visual Studio 2019. On the main form specify the Paint even handler as follows:

private void Form1_Paint(object sender, PaintEventArgs e)
{
    Font buttonFont = new Font("Webdings", 9.25f);
    TextRenderer.DrawText(e.Graphics, "0r", buttonFont, new Point(), Color.Black);
}

Run the program, here is what you will see if the checkbox is NOT checked:

However, if you check the checkbox (and reboot as asked) this changes to:

You can look up Webdings font on Wikipedia. According to character table given, the codes for these two characters are "\U0001F5D5\U0001F5D9". If I use them instead of "0r" it works with the checkbox checked but without the checkbox checked it now looks like this:

I would like to find a solution that always works that is regardless whether the box checked or unchecked.

Can this be done?

The system locale determines the ANSI and OEM codepages. The checkbox forces them to UTF-8 (codepage 65001). Apparently this has a secondary effect that causes `DrawText` to not render "0r" using the selected font. I'd guess it's because symbol fonts such as Webdings and Wingding don't claim any Unicode ranges or legacy codepages in the font's OS/2 table. Instead they map codes to arbitrary glyphs. Continuing to guess, maybe `"\U0001F5D5\U0001F5D9"` will work if you select a regular font. Apparently font fallback can find the needed font(s). — Eryk Sun, Jun 03 '19 at 15:47
maybe useful: https://learn.microsoft.com/en-us/dotnet/framework/winforms/advanced/international-fonts-in-windows-forms-and-controls — George Birbilis, Oct 01 '19 at 15:07
Other than the first two sentences, I was just speculating. You're closer to the problem and its solution than I am. I'd prefer for you to answer this yourself if you don't mind. — Eryk Sun, Dec 05 '19 at 11:50
What encoding is your source code in? Changing the Windows setting will change CP_ACP which will change the interpretation of characters in text files that do not have an explicit encoding (via BOM). You source appears to contain extended characters. In source code I recommend using unicode escape codes in string literals for any characters that aren't ASCII, otherwise you are depending on ambient settings which may vary (are varying in this scenario). — Paul Dempsey, Jan 31 '20 at 21:28
@PaulDempsey all information to reproduce it is given in the question. No special encoding changes were made that were not mentioned in the OP. Adittionally, my source does not appear to contain extended characters other than unicode escape codes in string literals, which are marked as such in the OP. — Andrew Savinykh, Jan 31 '20 at 22:42
You could take a look at https://en.wikipedia.org/wiki/Unicode_in_Microsoft_Windows — Mark, Feb 24 '21 at 14:57

user541686 · Answer 1 · 2019-08-03T20:35:50.710

16

You can see it in ProcMon. It seems to set the REG_SZ values ACP, MACCP, and OEMCP in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage to 65001.

I'm not entirely sure but it might be related to the variable gAnsiCodePage in KernelBase.dll, which GetACP reads. If you really want to, you might be able to change it dynamically for your program regardless of the system setting by dynamically disassembling GetACP to find the instruction sequence that reads gAnsiCodePage and obtaining a pointer to it, then updating the variable directly.

(Actually, I see references to an undocumented function named SetCPGlobal that would've done the job, but I can't find that function on my system. Not sure if it still exists.)

edited Aug 03 '19 at 20:35

answered Aug 03 '19 at 20:27

user541686

205,094
128
528
886

Thank you, indeed the values 'ACPT', 'MACCP' and 'OEMCP' change to 65001 if UTF-8 is ticked. How did you find out in the first place that these are the registry-values that are modified by the UTF-8 checkbox? – Joakim Thorén Apr 01 '20 at 09:09
@JoakimThorén: You can see it in [ProcMon](https://learn.microsoft.com/en-us/sysinternals/downloads/procmon)? – user541686 Apr 01 '20 at 09:10

Karol Zlot · Answer 2 · 2021-07-14T16:50:43.400

Please look at this question to see what it solves when it is enabled: How to save to file non-ascii output of program in Powershell?

Also I found explanation written by Ghisler helpful (source):

If you check this option, Windows will use codepage 65001 (Unicode UTF-8) instead of the local codepage like 1252 (Western Latin1) for all plain text files. The advantage is that text files created in e.g. Russian locale can also be read in other locale like Western or Central Europe. The downside is that ANSI-Only programs (most older programs) will show garbage instead of accented characters.

I leave here two ways to enable it, I think they will be helpful for many users:

Win+R -> intl.cpl
Administrative tab
Click the Change system locale button.
Enable Beta: Use Unicode UTF-8 for worldwide language support
Reboot

or alternatively via reg file:

Windows Registry Editor Version 5.00

[HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage]
"ACP"="65001"
"OEMCP"="65001"
"MACCP"="65001"

Nice; also worth mentioning that a reboot is required (if you do it via Control Panel, you'll be prompted). Note that the setting applies to both the _OEM_ code page (as used in _console_ applications; e.g. `437` ([CP437](https://en.wikipedia.org/wiki/Code_page_437)) on US-English systems) and the _ANSI_ code page (as used in GUI-subsystem applications; e.g., `1252` ([Windows-1252](https://en.wikipedia.org/wiki/Windows-1252)). Using this still-in-beta setting has far-reaching consequences, so it may not be for everyone - see [this answer](https://stackoverflow.com/a/57134096/45375). — mklement0, Jun 21 '21 at 20:12

score 6 · Answer 3 · answered Apr 08 '20 at 18:29

6

Most Windows C APIs come in two different variants:

"A" variant that uses 8-bit strings with whatever the systems configured encoding is. This varies depending on the configured country/language. (Microsoft calls the configured encoding the "ANSI Code Page", but it's not really anything to do with ANSI).
"W" variant that uses 16-bit strings in a fixed almost-UTF-16 encoding. (The "almost" is because "unpaired surrogates" are allowed; if you don't know what those are then don't worry about them).

The official Microsoft advice is not to use the "A" versions, but to ensure your code always use uses the "W" variants. That way you're supposed to get consistent behaviour no matter what the user's country/language is configured as.

However, it looks like that checkbox is doing more than one thing. It's clear it's supposed to change the "ANSI Code Page" to 65001, which means UTF-8. It looks like it's also changing font rendering to be more Unicody.

I suggest you detect if GetACP() == 65001, then draw the Unicode version of your strings, otherwise draw the old "0r" version. I'm not sure how you do that from .NET...

answered Apr 08 '20 at 18:29

user9876

10,954
6
44
66

1

The "A" variant worked with ANSI C, C89 and C90. The "W" variant did NOT work with ANSI C, and was not compatible with portable ('ANSI') C until the standards of C95 were widely adopted. – david Nov 21 '20 at 03:33
1

@david: I'm talking about the Windows-specific APIs such as CreateFileA / CreateFileW. They are completely independent of what language you're using, whether that's some revision of C, or C++, or Pascal, or whatever. You're talking about the changes Microsoft proposed to the official C specification to add better wide-char support, which are a different thing (and are part of the C runtime, not the core OS API). – user9876 Dec 05 '20 at 21:09
And I was commenting on your assertion that the ANSI API (available for use with ANSI c) had nothing to do with ANSI. – david Dec 07 '20 at 00:43
1

https://learn.microsoft.com/en-us/windows/win32/intl/code-pages explains the origin of the "A" suffix: "Windows code pages, commonly called "ANSI code pages", are [encodings] ... Originally, Windows code page 1252, the code page commonly used for English and other Western European languages, was based on an American National Standards Institute (ANSI) draft. ... but Windows code page 1252 was implemented before the standard became final, and is not exactly the same ... The "A" version handles text based on Windows code pages". – user9876 May 05 '21 at 00:34
I.e. the suffix is "A" because Windows supported a bunch of 8-bit encodings, the one commonly used by Microsoft was based on an ANSI draft code page, but most of the others had nothing to do with ANSI. So the entire encoding mechanism got wrongly called "ANSI Code Page" by western programmers, and the name stuck. It's nothing to do with ANSI C. – user9876 May 05 '21 at 00:37
The Windows double byte encodings were implemented at a time that ANSI C was an important name for an important coding standard that indicated a specific and important api -- the ANSI C api. "ANSI C" is the "ANSI" in the Windows ANSI api -- the api specified for ANSI C, – david May 06 '21 at 02:45
Muddy waters... quoting April 2022 Windows App Desktop Design, "Until recently, Windows has emphasized "Unicode" -W variants over -A APIs. However, recent releases have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs typically operate in UTF-8. This model has the benefit of supporting existing code built with -A APIs without any code changes." https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page – buzz3791 May 10 '22 at 21:36

score 1 · Answer 4 · answered Jun 01 '22 at 01:54

On my windows, When I checked the Beta: Use Unicode UTF-8 for worldwide language support. The following regedit values in HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Control\Nls\CodePage changed.

ACP: 936 -> 65001
MACCP: 10008 -> 65001
OEMCP : 936 -> 65001

If I do not checked, then the visual studio compilation failed with Exception: Bad UTF-8 encoding (U+FFFD; REPLACEMENT CHARACTER) found while decoding string: ..., If I checked, then the compilation successed, but the os is full with unreadable code.

What does "Beta: Use Unicode UTF-8 for worldwide language support" actually do?

4 Answers4

Linked