Printing bytes of UTF-8 string in C

Question

I wanted to print individual bytes of word "česnek" expecting to printf 7 bytes, because "č" is coded in 2 bytes, which it does but prints garbage character such as a question mark in terminal. If I print out the integer value, I get this sequence.

-60 -115 101 115 110 101 107

Why are the first two numbers negative? Here is the code I used to try it.

 char *utfstring = "česnek";
 for(size_t i = 0; i < strlen(utfstring); i++) {
 printf("%c ", utfstring[i]);
 }
 for(size_t i = 0; i < strlen(utfstring); i++) {
 printf("%d ", utfstring[i]);
 }

I expected first two values to be c4 8d because č is encoded like that according to https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&unicodeinhtml=dec

On some systems `char` are `signed`, on others there are `unsigned` — Basile Starynkevitch, Nov 05 '18 at 19:22
Possible duplicate of [Printing UTF-8 strings with printf - wide vs. multibyte string literals](https://stackoverflow.com/questions/15528359/printing-utf-8-strings-with-printf-wide-vs-multibyte-string-literals) — Vineet Jain, Nov 05 '18 at 19:23
The correct font for this character must be installed on the system and accessible to the given terminal program. On my system, it prints just fine with `printf("%s\n",utfstring);` Looping on `"%c "` will _not_ work (injecting the space _breaks_ the `utf-8` sequence). The compiler does _not_ care about this. Nor does `printf` with `%s` as it just sees a string of bytes. Only the terminal program/font will care. Better to print the individual chars with `%2.2X` to see individually [when in doubt, mask the value against `0xFF` to avoid sign extension of `char`]. — Craig Estey, Nov 05 '18 at 19:32

Barmak Shemirani · Answer 1 · 2018-11-05T21:47:33.410

5

Use (unsigned char)utfstring[i] or 0xFF & utfstring[i] to get hexadecimal output as follows:

char *utfstring = u8"česnek";
for(size_t i = 0; i < strlen(utfstring); i++)
    printf("%02X ", 0xFF & utfstring[i]);

output:

"C4 8D 65 73 6E 65 6B"

The first alphabetic character č cannot be represented by a single byte in UTF8. If you print utfstring one byte at a time, then the UTF8 encoding is broken.

It has to be printed as u8"č" or u8"\xC4\x8D"

In general you will need a Unicode library, such as iconv, if you wish to break the byte sequence in to separate Unicode code points. If you are simply trying to find č, then use the standard string functions, for example strstr(utfstring, u8"č").

edited Nov 05 '18 at 21:47

answered Nov 05 '18 at 20:54

Barmak Shemirani

30,904
6
40
77

"If you print utfstring one byte at a time, then the UTF8 encoding is broken.": Well, bytes go into the output stream one a time. Breakage would only come from the formatting functions or output stream transforming them (which for %c, I don't think would cause UTF-8 breakage), – Tom Blodget Nov 06 '18 at 11:59
@TomBlodget yes `printf("%c%c\n", 0xC4, 0x8D)` can work, but `printf("%c %c\n", 0xC4, 0x8D)` (with space in between) will print some ASCII characters, the latter is attempted in the question. – Barmak Shemirani Nov 06 '18 at 16:11

TypeIA · Answer 2 · 2018-11-06T05:01:30.087

1

First, the signedness of char is implementation-defined. On top of that, you're telling printf() to print a signed number by using %d. To portably print them as unsigned numbers, you need to cast them to unsigned and print them using the %u format specifier:

printf("%u ", (unsigned char) utfstring[i]);

That'll take care of the negative numbers, but you have another problem: the C standard does not require a compiler to accept UTF-8 encoded characters in source code. Only a small set of basic characters are guaranteed by the standard. You may need to check the documentation for your specific compiler and standard library to see how this is handled. You may get UTF-8, some other encoding, or garbage; and whatever you get, it isn't portable. If this sounds lame, you're right, it is - C/C++ have been playing catch-up for a long time when it comes to i18n.

The good news is, things are getting better. If your compiler supports C11, you can and should take advantage of UTF-8 string literals to portably encode UTF-8 code points in strings.

edited Nov 06 '18 at 05:01

answered Nov 05 '18 at 19:27

TypeIA

16,916
1
38
52

oh, i see, i thought i'd these values "c4 8d" first, becase that is the value of č according to https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&unicodeinhtml=dec – Nov 05 '18 at 20:16
`(unsigned) utfstring[i]` will not portably print them as desired. `(unsigned char) utfstring[i]` makes more sense unless you want -1 as FFFFFFFF – chux - Reinstate Monica Nov 05 '18 at 21:21
@chux I see your point, but doesn't `%u` expect an argument of type `unsigned`, not `unsigned char` - so in this situation do we need a double cast `(unsigned) (unsigned char) utfstring[i]`? – TypeIA Nov 05 '18 at 21:35
`(unsigned char) utfstring[i]` is sufficient. With arguments to ... (ellipsis) functions, positive `int` values must encode the same as `unsigned` per C11 §6.5.2.2 6. so using `"%u"` with `(unsigned char) utfstring[i]` is good specified functionality., – chux - Reinstate Monica Nov 05 '18 at 21:52
@chux Edited, thanks. Do you happen to know if that is true pre-C11 also? – TypeIA Nov 06 '18 at 05:02
@TypeIA C99 like C11 in this regard. As I read C89, the same applies - although less clearly. – chux - Reinstate Monica Nov 06 '18 at 06:24

score 0 · Answer 3 · answered Nov 05 '18 at 19:43

Your for-loop iterates through the character value byte-by-byte, when the UTF representation is multi-byte.

char *utfstring = "česnek"; is more than six bytes long! Because the first "character" in that string occupies more than one byte. (The cleverness of the UTF representation is that each of the bytes are self-encoded in such a way that, by examining the binary content of each byte alone, you can reliably determine what "kind" of byte it is, and where it falls [if applicable] in a multi-byte sequence.)

Your logic tries to use %c and then %d formats against these bytes when, arguably, neither one is most appropriate. "In this [human] context, these aren't really characters, nor are they integers." Try %x ... hexadecimal. "Show me the bits."

oh, i see, i thought i'd these values "c4 8d" first, becase that is the value of č according to https://www.utf8-chartable.de/unicode-utf8-table.pl?start=256&unicodeinhtml=dec — , Nov 05 '18 at 20:16
The OP stated they expected 7 bytes and they appear to understand UTF-8. The question was _why_ the printed numbers were negative decimals instead of hex. — TypeIA, Nov 05 '18 at 20:54

Printing bytes of UTF-8 string in C

3 Answers3