2

In Windows 10 and earlier, I have been able to transfer strings in my local codepage 1250 or with CP_ACP with the following code successfully to UTF-8. But in Windows 11, this does no longer work with CP_ACP (while 1250 still works). It seems that the default codepage is now 65001, which cannot be translated to UTF-8 this way. The result is simply false.

The reason is probably, that my string "Öf" in the example is not properly encoded in 65001. Now I have a big project, where the user enters strings and various third-party play a role, which all seem to deliver strings in 1250, or the current codepage of a non-European user.

Why is that? And what to do?

#include <Windows.h>

#include <cstdio>

int main()
{
    printf("UTF Conversation Test\n");

    char line[1000];
    WCHAR uline[1000];
    char uline1[1000];

    line[0] = 214;
    line[1] = 104;
    line[2] = 0;

    char *s1 = line;
    while (*s1 != 0)
    {
        printf("%10x %d\n", (int)*s1, (int)*s1);
        s1++;
    }
    printf("\n");

    MultiByteToWideChar(1250, 0, line, -1, uline, 1000);
    // MultiByteToWideChar(CP_ACP, 0, line, -1, uline, 1000);

    WCHAR* s2 = uline;

    while (*s2 != 0)
    {
        printf("%10x %d\n", (int)*s2, (int)*s2);
        s2++;
    }
    printf("\n");

    WideCharToMultiByte(CP_UTF8, 0, uline, -1, uline1, 1000, 0, 0);

    char *s3 = uline1;

    while (*s3 != 0)
    {
        printf("%10x %d\n", (int)*s3, (int)*s3);
        s3++;
    }
}
Chuck Walbourn
  • 38,259
  • 2
  • 58
  • 81
Rene
  • 3,746
  • 9
  • 29
  • 33
  • Does this answer your question? [Is codepage 65001 and utf-8 the same thing?](https://stackoverflow.com/questions/1629437/is-codepage-65001-and-utf-8-the-same-thing) – GSerg Dec 02 '21 at 15:08
  • https://learn.microsoft.com/en-us/windows/apps/design/globalizing/use-utf8-code-page – GSerg Dec 02 '21 at 15:12
  • What does `printf("%d\n", GetACP())` report? – Barmak Shemirani Dec 02 '21 at 16:04
  • 3
    `CP_ACP` means "use the local encoding" which varies by localization of Windows. 65001 is UTF-8 and Windows 11 has apparently changed the default (finally ). Use `1250` if you know it is encoded that way. Be explicit. – Mark Tolonen Dec 02 '21 at 17:43
  • 1
    Your example `char[]` array is using characters from Windows-1250 specifically, so it doesn't make sense to *ever* use `CP_ACP` to convert such data to UTF-16, since `CP_ACP` is not guaranteed to map to codepage 1250. Using codepage 1250 directly is the correct solution. Use `CP_ACP` only when processing text obtained from the user, ie via UI controls operating in ANSI mode (in which case, you really should be using UNICODE mode instead). Codepage 65001 (`CP_UTF8`) is Microsoft's UTF-8 codepage, so no conversion via `MultiByteToWideChar()` is needed if the `char[]` data is UTF-8 to begin with – Remy Lebeau Dec 02 '21 at 17:46
  • @MarkTolonen I don't have Windows 11, but there is no indication that Windows 11 switched to UTF8. The asker may have manually changed system settings, or used UTF8 manifest for the exe, or more likely there is another issue. – Barmak Shemirani Dec 02 '21 at 18:25
  • @BarmakShemirani I don't have Win11 either, but Win10 had the "Beta: Use Unicode UTF-8 for worldwide language support" option in Region Settings for quite a while. It sounds like the OP's Windows 11 at least has that option enabled. – Mark Tolonen Dec 02 '21 at 21:10
  • According to this image https://websiteforstudents.com/wp-content/uploads/2021/07/windows-11-current-system-locale.png it looks like UTF8 is still beta in Windows 11. – Barmak Shemirani Dec 02 '21 at 21:49
  • I did not enable anything UTF related when switching to Windows 11. It dit that by itself. Indeed get_ACP() returns 65001, so that is the default codepage on a Windows 11 system. Since my application (Euler Math Toolbox) is too involved with the old structure character arrays, even wide chars are too complicated to do. It simply uses the users current codepage in 8-bit mode. Now I have to find a way to continue this. But it is no longer possible to ask the system for a proper 8-bit codepage that works well on the system of the user, or is it? – Rene Dec 03 '21 at 20:27
  • First step is to check the result coming back from both those Win32 APIs. If they return 0, then call ``GetLastError()`` to see what they reported. – Chuck Walbourn Dec 04 '21 at 06:13
  • Which version of Visual C++ are you using? See [this issue](https://support.microsoft.com/en-us/topic/apps-using-legacy-crts-don-t-work-properly-with-certain-regional-settings-1b7b8caf-1a8f-083a-a864-79ad2493e2a7). – Chuck Walbourn Dec 04 '21 at 20:59
  • Tanks for all the answers. But the issue boils down to the intention of modern Windows version to use UTF-8 internally and not the local codepage. This is a good idea. For my application, however, it would require a major rewrite of several parts in the code. I would be happy if someone found a way to learn the codepage of the current user of my program. – Rene Dec 05 '21 at 21:51

1 Answers1

4

It turns out that Windows 11 activates Beta support for UTF-8 system-wide by default. This means that any programs that do not store strings in Unicode internally will have to translate to UTF-8 and back for using Windows services like screen output of characters. Even worse, some of their dialogs may stop to show local characters correctly. One solution is to disable this Beta support in the Administrative settings for the time and region.

Rene
  • 3,746
  • 9
  • 29
  • 33
  • 1
    It's weird that Windows 11 would use a beta feature as default. I can't find any announcement from Microsoft about this. Another possibility is that you installed some program, and that program secretly changed your system settings to UTF8. Try creating a new user account and check if it's still UTF8. – Barmak Shemirani Dec 07 '21 at 16:26
  • Mind you that adoption of this is very much a good thing in the long run, it will just give a bit of issues with converting current applications. It will get rid of a lot of weird issues with actually supporting languages, as motivated here: https://utf8everywhere.org/ – LaPingvino Dec 20 '21 at 19:44
  • 1
    Of course, it is the right idea to use Unicode. Java had 16-bit Unicode characters from the start. But Windows always had to fight with backward compatibility, and DOS was 8-bit strictly. Now, we have to fight with the shades of the past. For my program, I was compiling the dialogs with Visual Studio 2019. And I expect a UI to respect old compilations and run them correctly. – Rene Dec 21 '21 at 22:00
  • @Rene While that’s theoretically possible, there have to be few if any programs that depend on the system codepage being set to something other than the previous default. If you somehow do have one, I recommend creating a batch file that runs `chcp 1253` or whatever it needs, before the program. – Davislor Dec 23 '21 at 07:25
  • I've seen Windows 11 systems where this beta feature was enabled by default. But on most Windows 11 systems I came across it's usually it's not enabled. – Jabberwocky Jan 19 '23 at 14:11