1

I am using mbstowcs() to convert a UTF-8 encoded char* string to wchar_t*, and the latter will be fed into _wfopen(). However, I always get a NULL pointer from _wfopen() and I have found the problem is from the result of mbstowcs().

I prepared the following example and used printf for debugging...

size_t out_size;
int requiredSize;
wchar_t *wc_filename;
char *utf8_filename = "C:/Users/xxxxxxxx/Desktop/\xce\xb1\xce\xb2\xce\xb3.stdf";
wchar_t *expected_output = L"C:/Users/xxxxxxxx/Desktop/αβγ.stdf";

printf("input: %s, length: %d\n", utf8_filename, strlen(utf8_filename));
printf("correct out length is %d\n", wcslen(expected_output));

// convertion start here
setlocale(LC_ALL, "C.UTF-8");

requiredSize = mbstowcs(NULL, utf8_filename, 0);
wc_filename = (wchar_t*)malloc( (requiredSize+1) * sizeof(wchar_t));

printf("requiredsize: %d\n", requiredSize);

if (!wc_filename) {
    // allocation fail
    free(wc_filename);
    return -1;
}
out_size = mbstowcs(wc_filename, utf8_filename, requiredSize + 1);
if (out_size == (size_t)(-1)) {
    // convertion fail
    free(wc_filename);
    return -1;
}
printf("out_size: %d, wchar name: %ls\n", out_size, wc_filename);

if (wcscmp (wc_filename, expected_output) != 0) {
    printf("converted result is not correct\n");
}
free(wc_filename);

And the console output is:

input: C:/Users/xxxxxxxx/Desktop/αβγ.stdf, length: 37
correct out length is 34
requiredsize: 37
out_size: 37, wchar name: C:/Users/xxxxxxxx/Desktop/αβγ.stdf
converted result is not correct

I just don't know why expected_output and wc_filename have the same content but the length is different? What did I do wrong here?

Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
nochenon
  • 326
  • 2
  • 12

2 Answers2

2

The problem appears to be in your choice of locale name. Replacing the following:

setlocale(LC_ALL, "C.UTF-8");

with this:

setlocale(LC_ALL, "en_US.UTF-8");

fixes the issue on my system (Windows 10, MSVC, 64-bit build) – at least, the out_size and requiredSize are both 34 and the "converted result is not correct\n" message doesn't show. Using "en_GB.UTF-8" also worked.

I'm not sure if the C Standard actually defines what locale names are, but this question/answer may be helpful: Valid Locale Names.


Note: As mentioned in the comment by Mgetz, using setlocale(LC_ALL, ".UTF-8"); also works – I guess that would be the minimal and most portable locale name to use.

Second note: You can check if the setlocale call succeeded by comparing its return value to NULL. Using your original local name will give an error message if you use the following code (but not if you remove the leading "C"):

    if (setlocale(LC_ALL, "C.UTF-8") == NULL) {
        printf("Error setting locale!\n");
    }
Adrian Mole
  • 49,934
  • 160
  • 51
  • 83
  • On windows all they need is [`".UTF-8"`](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setlocale-wsetlocale?view=msvc-160) unless they want something other than character set changed. – Mgetz Aug 19 '21 at 13:41
  • @Mgetz Excellent - thanks! I have edited your suggestion into my answer. – Adrian Mole Aug 19 '21 at 13:44
  • Was literally dealing with myself this morning... irony – Mgetz Aug 19 '21 at 13:44
  • @Mgetz Locales and C/C++ handling of UTF codes are both weird and wonderful. :-) – Adrian Mole Aug 19 '21 at 13:45
  • @mgetz: and if you use `""`? That certainly works on Unix (assuming the locale is set to utf-8 in the environment, which should be the case if the terminal correctly shows utf-8 streams). – rici Aug 19 '21 at 14:54
  • 1
    There are only two locale strings defined by the C: standard: `"C"` and `""`: «A value of "C" for locale specifies the minimal environment for C translation; a value of "" for locale specifies the locale-specific native environment. Other implementation-defined strings may be passed as the second argument to setlocale.» (7.11.1.1/3) (To be extra clear, `"C"` refers only to the complete string containing only the letter `C`, not any string starting `C.`.) – rici Aug 19 '21 at 15:02
  • @rici *If locale points to an empty string, the locale is the implementation-defined native environment.* per [msdn](https://learn.microsoft.com/en-us/cpp/c-runtime-library/locale-names-languages-and-country-region-strings?view=msvc-160) – Mgetz Aug 19 '21 at 15:03
  • @mgetz: yes. That's what the C standard says, too. The question is, what is the MB encoding in that native environment? In Unix, it's taken from the environment variables set by the user (euphemistically speaking) which these days is almost always UTF-8. – rici Aug 19 '21 at 15:05
  • @rici whatever the current user has set as their ANSI code page I'd assume. As best I can tell it's whatever is returned from [`GetUserDefaultLocaleName`](https://learn.microsoft.com/en-us/windows/win32/api/winnls/nf-winnls-getuserdefaultlocalename). Also worth noting: *A locale argument value of C specifies the minimal ANSI conforming environment for C translation. The C locale assumes that every char data type is 1 byte and its value is always less than 256. If locale points to an empty string, the locale is the implementation-defined native environment.* – Mgetz Aug 19 '21 at 15:08
  • Since the terminal emulator renders graphemes based on this sane MB encoding, if UTF-8 streams render correctly, then the default environment string is correct. Note that this is *not* the same as the locale previous to a call to setlocale; the locale at startup is always `"C"` – rici Aug 19 '21 at 15:10
  • @rici While this discussion on (default) locales is both interesting and useful, I think that editing its 'essence' into my answer would be stepping into the territory already covered by the post I already linked (and links given therein). – Adrian Mole Aug 19 '21 at 15:11
  • 1
    @AdrianMole: in part, it's an attempt to help you overcome your uncertainties, re: "I'm not sure if the C Standard actually defines what locale names are" and "I guess...". :-) – rici Aug 19 '21 at 15:23
  • In Unix, at least, the best advice is to use the minimal string `""` if your goal is to produce correctly rendered console output. – rici Aug 19 '21 at 15:26
  • Thanks so much for the info. Unfortunately, setlocale() only works for "" and "C", but the output is still wrong; For others like "en_US.UTF-8", ".UTF-8", setlocale() always return NULL, my OS build is 19042.985 using mingw64. – nochenon Aug 20 '21 at 04:01
  • @nochenon: When you call `setlocale(LC_ALL, "")`, what is the value of the returned string? (If it is not NULL, it should be a printable string, and it will tell you something about your system's locale configuration. So printing it out is sometimes a useful debugging technique.) – rici Aug 20 '21 at 15:11
  • @rici It says "English_United States.1252", is it the reason ".utf-8" not working? – nochenon Aug 23 '21 at 06:17
  • @nochenon: [CP 1252](https://en.wikipedia.org/wiki/Windows-1252) is a single-byte encoding similar to ISO-8859-1, sometimes called Latin-1. If that's what your locale is set to, then mbstowcs will assume that the input string is in that encoding. That makes the mbstowcs conversion pretty simple, because the Unicode code point for most CP-1252 characters is the same code; in other words, conversion only requires taking the byte and making it a short. However, I don't believe that your terminal is actually displaying CP-1252; it looks to me like it's displaying UTF-8,... – rici Aug 23 '21 at 07:10
  • 1
    ... because of the three extra bytes which are reported. I guess that's all an artefact of Mingw64; I'm afraid I don't know much about how to configure locales on Mingw64 because it's not something I've ever had to do, and a quick internet search only found me people saying that they had problems. So I don't really know what to say, sorry. But I'm guessing that it's an issue with your mingw64 install and not with your windows-10 install. (Although that is just a guess.) You need to find a UTF-8 locale to get all that stuff to work properly. – rici Aug 23 '21 at 07:12
  • Thanks @rici, I have one more question, are there any methods that I can call to get a list of valid locale names that I can used in setlocale()? Google is not helping.. – nochenon Aug 23 '21 at 08:18
  • @nochenon: I wish I could help but I really don't know and I don't have a Windows machine to try anything. There are Windows-specific functions which will provide information like that. Also, there is a Windows-specific interface for converting UTF-8 to wchar_t, which probably will work on your system. (Of course, it won't work on Unix :-) ) – rici Aug 24 '21 at 01:29
  • Thanks, guess I have to look at the python repo to see what tricks they've used to support unicode across all platform. ;) – nochenon Aug 24 '21 at 07:19
0

Universal CRT supports UTF-8, but MSVCRT.DLL is not. When using MINGW, you need to link to UCRT.

  • the answer can become more comprehensive with additional instructions, please consider adding. – Sadra Mar 07 '22 at 04:54