0

From reading docs in either MSDN or the n1256 committee draft, I was under the impression that a char would always be exactly CHAR_BIT bits as defined in <limits.h>. If CHAR_BIT is set to 8, then a byte is 8 bits long, and so is a char.

Test code

Given the following C code:

int main(int argc, char **argv) {
    int length = 0;
    while (argv[1][length] != '\0') {
        // print the character, its hexa value, and its size
        printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
                length,
                argv[1][length],
                argv[1][length],
                sizeof argv[1][length]);
        length++;
    }
    printf("\nTotal length: %u\n", length);
    printf("Actual char size: %u\n", CHAR_BIT);
     
    return 0;
}

I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç and à.

Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça has a length of 3 for example (4 if counting the \0) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).

Output

$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t       value: 0x74      sizeof char: 1
char 1: e       value: 0x65      sizeof char: 1
char 2: s       value: 0x73      sizeof char: 1
char 3: t       value: 0x74      sizeof char: 1
char 4: _       value: 0x5F      sizeof char: 1
char 5: τ       value: 0xFFFFFFE7        sizeof char: 1
char 6: α       value: 0xFFFFFFE0        sizeof char: 1

Total length: 7
Actual char size: 8

Question

What is probably happening under the hood is char **argv is turned into int **argv. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.

  1. Is that what actually happens ?
  2. Is it standard behaviour ?
  3. Why chars 5 and 6 are not what is given as input ?
  4. CHAR_BIT == 8 and sizeof(achar) == 1 and somechar = 0xFFFFFFE7. This seems counter-intuitive. What's happening ?

Environment

  • Windows 10
  • Terminal: Alacritty and Windows default cmd (tried in both just in case)
  • GCC under Mingw-w64
  • 7
    You are misinterpreting what your program is doing and the results. `printf` is printing the `char` promoted to `int`. Your characters are above `127` ASCII so are interpreted as negatives, then sign extended to negative `int`s. Then you are using `%x` to print them, and getting the hexadecimal 2's complement representation of these. Unrelated to anything else – Eugene Sh. Jan 27 '22 at 17:16
  • 3
    So you are right that `char` is becoming `int` at some point. But that point is in the `printf` invocation. Since it is a variadic function, it's arguments are undergoing *default promotion* (for `char` it will be `int`). – Eugene Sh. Jan 27 '22 at 17:24
  • 1
    There are also some bugs in the code. (1) `0x%X` should be `0x%zX` since the corresponding argument has type `size_t`. (2) `printf("\nTotal length: %u\n", argv[1][length], length);` has too many arguments. – Ian Abbott Jan 27 '22 at 17:28
  • @EugeneSh. I think I get what you mean. In this case, I suppose I have to interact with the windows API to get back my actual characters in a different form of encoding. – Valentin O. Jan 27 '22 at 17:30
  • 2
    @ValentinO. You might want to look at https://stackoverflow.com/questions/4101864/windows-unicode-commandline-argv – Eugene Sh. Jan 27 '22 at 17:31
  • @IanAbbott For (1) I will look into it to understand what this is. For (2) I will correct my question, I was formatting and removing redundant dumps to make the code as simple as possible. Thanks for the heads-up. – Valentin O. Jan 27 '22 at 17:32
  • 1
    @ValentinO. Actually, I got that wrong. The `0x%X` is OK, but `sizeof char: %u` should be `sizeof char: %zu`. And this assumes you are using MingW's replacement formatted standard I/O routines, and not the ones from msvcrt.dll. – Ian Abbott Jan 27 '22 at 17:35
  • 1) No. 2) Not applicable. – ikegami Jan 27 '22 at 17:41
  • 3) The command line is being fetched from the system using a call that provide data encoded coded according to the Active Code Page. This would be CP1252 for a en-us install of Windows. "ç" is E7 in CP1252. Use `GetCommandLineW` + `CommandLineToArgvW` to get the command line in UTF-16le. – ikegami Jan 27 '22 at 17:45
  • 4) Any integers smaller than an int or unsigned int are promoted to one of these when passed to a vararg function like `printf`. This means the char with a value of -25 gets passed as an `int` with a value of -25. The issue is that `printf "%u"` expects an `unsigned int`. Using `%hhu` should do the trick, though it might not be the "correct" approach. – ikegami Jan 27 '22 at 17:45

2 Answers2

3

No, it's not received as an array of int.

But it's not far from the truth: printf is indeed receiving the char as an int.

When passing an integer type small than an int to a vararg function like printf, it gets promoted to an int. On your system, char is a signed type.[1] Given a char with a value of -25, an int with a value of -25 was passed to printf. %u expects an unsigned int, so it's treating the int with a value of -25 as an unsigned int, printing 0xFFFFFFE7.

A simple fix:

printf("%X\n", (unsigned char)c);   // 74 65 73 74 5F E7 E0

But why did you get E7 and E0 in the first place?

Each Windows system call that deals with text has two versions:

  • An "ANSI" (A) version that deals with text encoded using the system's Active Code Page.[2] For en-us installs of Windows, this is cp1252.
  • And a Wide (W) version that deals with text encoded using UTF-16le.

The command line is being obtained from the system using GetCommandLineA, the A version of GetCommandLine. Your system uses cp1252 as its ACP. Encoded using cp1252, ç is E7, and à is E0.

GetCommandLineW will provide the command line as UTF-16le, and CommandLineToArgvW will parse it.


Finally, why did E7 and E0 show as τ and α?

The terminal's encoding is different than the ACP! On your machine, it appears to be 437. (This can be changed.) Encoded using cp437, τ is E7, and α is E0.

Issuing chcp 1252 will set that terminal's encoding to cp1252, matching the ACP. (UTF-8 is 65001.)

You can query the terminal's encoding using GetConsoleCP (for input) and GetConsoleOutputCP (for output). Yeah, apparently they can be different? I don't know how that would happen in practice.


  1. It's up the compiler whether char is a signed or unsigned type.
  2. This can be changed on a per program basis since Windows 10, Version 1903 (May 2019 Update).
ikegami
  • 367,544
  • 15
  • 269
  • 518
  • Good summary! You seem to have first hand experience with these painful issues. It is funny how Microsoft can change the user interface of its OS and products and force users to re-learn the new UI many times over, and on the other hand never ditches bad choices in the system APIs: code pages, 16-bit wchar encodings, CR/LF, clumsy long filenames, 32-bit LONG type... the list goes on. – chqrlie Jan 27 '22 at 18:53
  • @chqrlie, Unix has customizable encodings too. It just calls them locales instead of code pages. The diff is that you have to deal with multiple encodings at once with Windows. This is awful. The change I mentioned in footnote 1 makes this a whole lot better. // Granted, using a variable-length encoding (UTF-16) while providing no facilities to work with it as a variable-length encoding is awful. // I don't know of any program that doesn't accept just LF // You can signal your ability to handle long paths safely using the manifest. // If you want a 64-bit type, use `uint64_t` Windows or not. – ikegami Jan 27 '22 at 19:10
  • @chqrlie, Most of the encoding problems are really due to unix problems being foisted upon Windows. (To wit, C's unix-centric standard library.) If you just use the `W` functions, you only have to deal with one encoding (UTF-16le). That's pretty much been the case since Win95. It's unix that requires dealing with system-specific encodings. – ikegami Jan 27 '22 at 19:14
  • This is very well explained, thank you. As other people have mentioned, I should use the windows API or a windows custom main signature to get back the input in a predictable encoding. – Valentin O. Jan 27 '22 at 19:28
2

From your code and the output on your system, it appears that:

  • type char has indeed 8 bits. Its size is 1 by definition. char **argv is a pointer to an array of pointers to C strings, null terminated arrays of char (8-bit bytes).
  • the char type is signed for your compiler configuration, hence the output 0xFFFFFFE7 and 0xFFFFFFE0 for values beyond 127. char values are passed as int to printf, which interprets the value as unsigned for the %X conversion. The behavior is technically undefined, but in practice negative values are offset by 232 when used as unsigned. You can configure gcc to make the char type unsigned by default with -funsigned-char, a safer choice that is also more consistent with the C library behavior.
  • the 2 non ASCII characters çà are encoded as single bytes E7 and E0, which correspond to Microsoft's proprietary encoding, their code page Windows-1252, not UTF-8 as you assume.

The situation is ultimately confusing: the command line argument is passed to the program encoded with the Windows-1252 code page, but the terminal uses the old MS/DOS code page 437 for compatibility with historic stuff. Hence your program outputs the bytes it receives as command line arguments, but the terminal shows the corresponding characters from CP437, namely τ and α.

Microsoft made historic decisions regarding the encoding of non ASCII characters that seem obsolete by today's standards, it is a shame they seem stuck with cumbersome choices other vendors have steered away from for good reasons. Programming in C in this environment is a rough road.

UTF-8 was invented in September of 1992 by Unix team leaders Kenneth Thomson and Rob Pike. They implemented it in plan-9 overnight as it had a number of interesting properties for compatibility with the C language character strings. Microsoft had already invested millions in their own system and ignored this simpler approach, which has become ubiquitous on the web today.

chqrlie
  • 131,814
  • 10
  • 121
  • 189