From reading docs in either MSDN or the n1256
committee draft, I was under the impression that a char
would always be exactly CHAR_BIT
bits as defined in <limits.h>
.
If CHAR_BIT
is set to 8, then a byte is 8 bits long, and so is a char
.
Test code
Given the following C code:
int main(int argc, char **argv) {
int length = 0;
while (argv[1][length] != '\0') {
// print the character, its hexa value, and its size
printf("char %u: %c\tvalue: 0x%X\t sizeof char: %u\n",
length,
argv[1][length],
argv[1][length],
sizeof argv[1][length]);
length++;
}
printf("\nTotal length: %u\n", length);
printf("Actual char size: %u\n", CHAR_BIT);
return 0;
}
I was unsure what the behaviour would be, given arguments that include non-ASCII chars, like ç
and à
.
Those chars are supposedly UTF-8, so written as multiple bytes each. I would expect them to get processed as individual bytes, meaning ça
has a length of 3 for example (4 if counting the \0
) and when printing, I'd get one line per byte, and so 3 lines instead of 2 (which would be the actual latin character count).
Output
$ gcc --std=c99 -o program.exe win32.c
$ program.exe test_çà
char 0: t value: 0x74 sizeof char: 1
char 1: e value: 0x65 sizeof char: 1
char 2: s value: 0x73 sizeof char: 1
char 3: t value: 0x74 sizeof char: 1
char 4: _ value: 0x5F sizeof char: 1
char 5: τ value: 0xFFFFFFE7 sizeof char: 1
char 6: α value: 0xFFFFFFE0 sizeof char: 1
Total length: 7
Actual char size: 8
Question
What is probably happening under the hood is char **argv
is turned into int **argv
. This would explain why line 5 and 6 have an hexadecimal value written on 4 bytes.
- Is that what actually happens ?
- Is it standard behaviour ?
- Why chars 5 and 6 are not what is given as input ?
CHAR_BIT == 8
andsizeof(achar) == 1
andsomechar = 0xFFFFFFE7
. This seems counter-intuitive. What's happening ?
Environment
- Windows 10
- Terminal: Alacritty and Windows default cmd (tried in both just in case)
- GCC under Mingw-w64