1

I have simple program.

#include <stdio.h>
#include <string.h

int main(int argc, char *argv[])
{   
    for (int i = 0; i < strlen(argv[1]); ++i)
        printf("%x ", argv[1][i]);
    printf("\n");
}

I run it like

$ ./program 111
31 31 31

But when I run it like

$ ./program ●●●
ffffffe2 ffffff97 ffffff8f ffffffe2 ffffff97 ffffff8f ffffffe2 ffffff97 ffffff8f

Here each is should be encoded by 3 bytes (UTF-8): e2 97 8f, but looks like it is encoded by 3 unsigned. I don't understand where the ffffff comes from if sizeof(char) is always 1 byte.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Kirill Bugaev
  • 370
  • 3
  • 13

3 Answers3

4

printf() is a function accepting a variable number of arguments.

Any integer argument of a type shorter than int is automatically converted to type int.

Apparently, in your implementation, the "character" little-round-thing is composed of 3 chars, all with a negative value.

Try these

printf("%x ", (unsigned char)argv[1][i]);
printf("%hhx ", argv[1][i]); // thanks to Jonathan Leffler
pmg
  • 106,608
  • 13
  • 126
  • 198
2

I don't understand where the ffffff comes from if sizeof(char) is always 1 byte.

by definition sizeof(char) is 1, but '●' is not a char in the C sense and produces 3 char

your char are visibly signed (a char is a signed char by default in your case), each the input ● produce each 3 negative codes, because your char is converted to an int (32b in your case) and the format %x consider the argument unsigned you have these output

you will have the same output doing printf("%x", -30); -> ffffffe2


note to do for (int i = 0; i < strlen(argv[1]); ++i) is expensive for nothing, the length doesn't change, better to save it or to just do for (int i = 0; argv[1][i] != 0; ++i)

it was also better to check argc is at least 1 before to look into argv[1]

bruno
  • 32,421
  • 7
  • 25
  • 37
2

UTF-8 codeunits for multi-codeunit codepoints (everything but ASCII) are all from 128 to 255, meaning outside the ASCII range.

printf() is a vararg function, and all the arguments passed to the vararg part (all but the format-string) are subject to the standard promotions.

As your implementation's bare char is 8bit signed 2s-complement, meaning the UTF-8 codeunit-value is negative, and between -1 and -128, after promotion you have an int with that value.

Then you lie to printf() by asserting it's an unsigned (%x is for unsigned int), and 2s-complement results in your Undefined Behavior printing a very big unsigned int.

You could get the right result by using %hhx, though strictly speaking you should cast the argument to unsigned char.

Deduplicator
  • 44,692
  • 7
  • 66
  • 118