0

I'm looking to move from ASCII to UTF-8 everywhere in my Windows Desktop (Win32/MFC) application. This is as opposed to doing the usual move to UTF-16. The idea being, fewer changes will need to be made, and interfacing with external systems that talk in UTF-8 will require less work.

The problem is that the static control and button in the dialog box from the resource file only ever displays the first character of its kanji text. Should resource files work just fine using UTF-8?

Dialog illustrating the problem

UTF-8 strings appear to be read and displayed correctly coming from the String Table in the resource file, but not text directly on dialogs themselves.

I am testing using kanji characters. How the dialog appears in the resource editor

I have:

Using UTF-8 everywhere means std::string, CStringA and the -A Win32 functions implicitly by using the "Advanced/Character Set" value of "Not Set". Additionally, the resource file is in UTF-8, including dialogs with their text, String Tables etc. If I set it to "Use Unicode Character Set", my understanding is that UTF-16 and -W functions will be the default everywhere - the standard Windows way of supporting Unicode historically.

The pragma appears to work, as the Resource Editor in Visual Studio does not clobber the .rc file into UTF-16LE. Also, the manifest appears to work as the MessageBox() (MessageBoxA) function displays text from the String Table correctly. Without the manifest, the MessageBox() displays question marks.

      TCHAR buffer[512];
      LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
      MessageBox(hWnd, buffer, _T("Caption"), MB_OK);

Successful message box

If I set the Character Encoding to "Use Unicode Character Set", everything appears to work as expected - all characters are displayed. Dialog successfully showing kanji

My suspicion is that the encoding is going UTF-8(.rc file) -> UTF-16(internal representation) -> ASCII (Dialog text loading?), meets a null character from the UTF-16 representation, and stops after reading the first character.

If I call SetDlgItemText() on my static control using text from the String Table, the static control will show all the characters correctly:

case WM_COMMAND:
   if (LOWORD(wParam) == IDOK)
   {
      TCHAR buffer[512];
      LoadString(hInst, IDS_TESTKANJI, buffer, 512 - 1);
      SetDlgItemText(hDlg, IDC_STATIC, buffer);
      ...
  • Windows OS Build: 19044.2130
  • Visual Studio 2022 17.4.2
  • Windows SDK Version: 10.0.22621.0
  • Platform Toolset: Visual Studio 2022 (v143)
id48jkdl
  • 11
  • 2
  • 1
    The resource editor may silently decide to remove that line `#pragma code_page(65001)`, check if it's still there. I wouldn't pay attention to that "UTF-8 everywhere" crap. It's nonsense. Windows UTF-8 features are experimental, it's not a real option. You can use UTF8 for file storage, networking, etc. Otherwise stick to UTF-16 or use .net or something else. – Barmak Shemirani Dec 12 '22 at 18:38
  • @BarmakShemirani Good question, but the pragma is left untouched. A lot of posts/documentation related to this area are from the distant past, so I wasn't sure if enough usable support had finally arrived recently. Even the current Microsoft documentation in relation to the UTF-8 code pages says: _"...you **might** need to convert UTF-8 data to UTF-16 (or vice versa) to interoperate with Windows APIs."_ - which APIs?! – id48jkdl Dec 12 '22 at 19:50
  • Also check if there is something like `#pragma code_page(1252)` further down in your *.rc file. If you find it, comment it out and replace with `#pragma code_page(65001)`. Double check the unicode characters to make sure the resource editor didn't change them. – Barmak Shemirani Dec 12 '22 at 20:37
  • About your second question, there are a few newer APIs such as `SendMessageW(hcombo, CB_SETCUEBANNER, ...)` (for setting the cue/hint in combobox) which don't have ANSI version. You must convert UTF8 to UTF16. There will be an additional problem in MFC, because `CComboBox::SetCueBanner` is not even available in ANSI, you must convert to UTF16 and use `SendMessageW` directly. – Barmak Shemirani Dec 12 '22 at 20:40
  • @BarmakShemirani No additional pragmas found. Indeed, I inserted one just before the dialog definition in the .rc file, and the characters became garbage with any "Character Set" setting. I think the .rc file contents are actually okay based on the ability of the String Table to deliver the desirable results. – id48jkdl Dec 12 '22 at 21:09
  • @BarmakShemirani That's most interesting in relation to CB_SETCUEBANNER/SetCueBanner. I can't see this unusual characteristic documented for either on their respective documentation pages, but I did find this: [Deprecated ANSI APIs](https://learn.microsoft.com/en-us/cpp/mfc/deprecated-ansi-apis?view=msvc-170). So ANSI APIs are being deprecated, but they're also being used for UTF-8 support? – id48jkdl Dec 12 '22 at 21:20
  • I haven't seen that list before, it makes no sense to me. You can ask a separate question about it, or maybe it's been discussed before. `Set/GetCueBanner` is not ANSI and should not be marked as deprecated. I don't know why `GetIdealSize` is marked as deprecated. *"ANSI character set"* is of course deprecated. But using ANSI APIs for UTF8 is forward compatible. Microsoft can one day fix the UTF thing and release it, or make a clear statement about it. – Barmak Shemirani Dec 12 '22 at 22:36

1 Answers1

0

It seems like the current answer to displaying UTF-8 text on dialogs is to manually - in code - set the text using a function like SetDlgItemText() with the UTF-8 string, and not rely on the resource loading of the dialog creation code itself. With the UTF-8 manifest, the -A functions are called, and they'll set the UTF-8 text just fine.

Can also call a -W function explicitly, and convert UTF-8 -> UTF-16 before calling. See UTF-8 text in MFC application that uses Multibyte character set.

See also Microsoft CreateDialogIndirectA macro (winuser.h) which is unusually explicit in relation to this: "All character strings in the dialog box template, such as titles for the dialog box and buttons, must be Unicode strings."

id48jkdl
  • 11
  • 2