Why printf can display non-ASCII characters when "C" locale is used?

Question

Note: I'm asking an implementation defined behavior which is on Microsoft Visual C++ 2008(possibly the same on 2005+). OS: simplified Chinese installation of Win7.

It surprises me when I'm performing non-ASCII I/O w/ printf. E.g.

   // This won't be necessary as it's the system default code page.
   //system("chcp 936");
   
   // NULL to show current locale, which is "C"
   printf ("%s\n", setlocale(LC_ALL, NULL));
   printf ("中\n");
   printf ("%s\n", setlocale(LC_ALL, "English"));
   printf ("中\n");

Output:

Active code page: 936
C
中
English_United States.1252
?D

The memory footprint in debugger shows that "中" is encoded in two bytes: 0xD6, 0xD0, which is the code point of that character in code page 936, for simplified Chinese. It shouldn't be in the code point range of "C" locale which, most likely, is 0x0 ~ 0x7F.

Question:

Why can it still display the character correctly in "C" locale? So I made a guess that locale had no bearing on printf? But then, I shall ask, why can't it display anymore when changing to "English" locale, which is also different from 936? Interesting?

Edit:

I redirected the standard output to a file and took some test. It shows that whatever locale is set, the correct character "中" is saved in the file. It suggests that setlocale() is connected to the way console displays the character, which contradicts my understanding of how it works: printf puts the bytes/code points into input buffer of console, which interprets these bytes using its own code page(what chcp returns).

I suppose the C locale on your system is Unicode (UTF-8 or the alike), while the English locale covers ASCII. — , May 05 '13 at 09:24
I think not. How can a GB2312-encoded character be decoded using UTF-8? Btw, in Microsoft's world, I think English locale(ANSI) is a superset of the "C" locale(ASCII) they call. — Eric Z, May 05 '13 at 11:02
It can't be Unicode(at least UTF-16), as `wprintf(L"中")` doesn't display the correct character. In VC2008(maybe 2005+), wide characters are encoded as UCS2/UTF-16. But I agree w/ you that conversion may kinda exist. I'll update my question. — Eric Z, May 05 '13 at 11:41

score 3 · Answer 1 · answered May 06 '13 at 08:36

3

936 is rather tricky codepage, it allows 2 symbols character (similar it is done by UTF-8). For example Cyrillic (866) - doesn't allows two-byte characters and it behavior will be the same as "English".

So when you use default(936) codepage it knows how to process 2-symbol character, while "English" deals with 0x0 ~ 0x7f only.

Let me also answer why wprintf(L"中") fails. There are big difference between console application and Windows-window application, they use different codepages Follow is matches between console and windows:

DOS   |   Windows
------+----------
850   |  1252
936   | 54936
866   |  1251

So if you would like to see in console correct symbols use WideCharToMultiByte first - that provides expected conversion to allow console work in 936

answered May 06 '13 at 08:36

Dewfy

23,277
13
73
121

Thanks. I can understand why `"English"` locale doesn't work, sort of. But my major concern is that why `"C"` locale works? `"C"` locale shouldn't be any UTF or 936, hopefully. Why is that working while "English" not? – Eric Z May 06 '13 at 08:41
@EricZ - before answer UTF is not possible as console output. Ok, just imagine following pseudo-code in printf impl: ` if(consume_two_bytes(*str)){ console_put2(*str, *(str+1)); }`. So `NULL` is mapped to 936 handler, 936 handler knows that exists 2-byte characters that have to be processed as single. (Similar way used for UTF-8). But for "English" handler there is no checking of 2-byte chars – Dewfy May 06 '13 at 08:52
NULL is not actually null. It's the default "C" locale instead. So it should have been treated the same way as "English", cos neither should be able to handle Chinese characters. Now the fact is printf somehow translates the code points to something else when locale is "English", which causes the console decode incorrectly. I suspect it's a bug because all printf should do is to put bytes into console buffer. Since console code page is already the same as how char* literals are encoded, setlocale shouldn't have any impact on printf. Just can't find any official doc/bug report out there. – Eric Z May 07 '13 at 00:56
@EricZ from doc: *"The function can also be used to retrieve the current locale's name by passing NULL as the value for locale."* It means that NULL is not the "English" (850), but yous set for local Windows version (obviously 936). So there is no bug in printf it is documented behavior. – Dewfy May 07 '13 at 07:45
The quoted statement is right, but your conclusion is wrong, unfortunately;) Passing NULL does return the current locale, but it's "C" for both C and C++ by default. It's NOT locale related to system ANSI code page(936 in this case)! As to the bug or not, that's the one I'm NOT sure. – Eric Z May 07 '13 at 08:48
Exactly 936 - is not ANSI - it is old-school DOS code page. And actually you command `chcp` - is also relevant to DOS console. – Dewfy May 07 '13 at 08:56

score 3 · Answer 2 · answered May 07 '13 at 17:05

The fact that the C locale prints out the string exactly as given is not surprising. That's what I would expect. What is surprising is that the English locale would do something different.

According do the locale documentation on MSDN, the only effect that locale should have on printf is in determining the radix character for numeric values (i.e. the decimal point).

I suspect perhaps that it's a bug in Microsoft's Compiler. Or at the very least it's undocumented behaviour.

For what it's worth, on my compiler (Borland) the locale has no effect on the output of those strings. It does effect the radix though.

Agreed. Actually locales other than "C" have a weird/unexpected impact on printf. — Eric Z, May 07 '13 at 23:37

Eric Z · Accepted Answer · 2013-05-17T00:37:31.290

OK. For the default "C" locale, CRT assumes that characters passed to printf don't need any conversion. It has a reason because the ASCII characters almost always fall into the basic character set of the execution system(shared among different Windows code pages). When switched to "English", it assumes the input is encoded in code page 1252, and thus tries to perform a conversion from "English" to "Chinese", which is the locale used by the console. But CRT just cannot find the character 中 in code page 1252. That's why it outputs a question mark.

When redirected to a file, CRT knows it and won't do the conversion, because the console code page is no longer used. It just passes through the bytes as-is. How those bytes are interpreted is up to the program you use(e.g., care about BOM or not) when you open the file.

Refer to this MSDN forum link: Why printf can display non-ASCII characters when “C” locale is used?

Why printf can display non-ASCII characters when "C" locale is used?

3 Answers3