Multibyte characters and u8 strings in C

Question

I find a bit difficult to fully grasp the use of u8 strings. I know they are UTF-8-encoded strings, but the results of my tests seem to point towards another direction. I'm using gcc 7.5 on Linux. This is my test code:

#include <stdio.h>
#include <string.h>

int main()
{
    char a[] = u8"gå";
    int l = strlen(a);
    for(int i=0; i<l; i++)
        printf("%c - %d - %ld\n", a[i], (unsigned char)a[i], sizeof(a[i]));
    printf("%d: %s\n", l, a);
    return 0;
}

After running, I get this:

g - 103 - 1
� - 195 - 1
� - 165 - 1
3: gå

Which makes sense: it's using 2 bytes to encode the å, and 1 byte to encode the g, total, 3 bytes.

Then I remove the u8 prefix, and I get the same result. I might think, according to the standard, that gcc is actually using UTF-8 to encode strings by default. So far, so good.

But now I try something else: I restore the u8 prefix back again, and change the encoding of the source file to ISO-8859. And I get this:

g - 103 - 1
� - 229 - 1
2: g�

Not only the encoding has changed (it shouldn't have, as it's a u8 string), but the string prints incorrectly. If I remove the prefix again, I get this last result once more.

It's acting as if the u8 prefix is ignored, and the encoding is decided by the source file text encoding.

So my 2 questions here are:

Why isn't the u8 prefix doing anything?
Why isn't the string printing well when I encode my source code to ISO-8859?

See https://stackoverflow.com/questions/12216946/gcc-4-7-source-character-encoding-and-execution-character-encoding-for-string-li — , Aug 22 '20 at 18:30
Also related https://stackoverflow.com/questions/13444930/is-the-u8-string-literal-necessary-in-c11 — Acorn, Aug 22 '20 at 18:37

score 1 · Accepted Answer · answered Aug 22 '20 at 18:58

u8 only ensures the string in your binary is UTF-8 encoded regardless of the execution character set. It is a noop if you target UTF-8.

Problems arise when the source character set you told the compiler to use does not match the encoding of the file. If those do match, and the string was properly reencoded when saving the file, and you use u8, then in both cases you should not see any difference in the output. If you do not use u8, then the result depends on the execution character set.

Multibyte characters and u8 strings in C

1 Answers1