I find a bit difficult to fully grasp the use of u8
strings. I know they are UTF-8-encoded strings, but the results of my tests seem to point towards another direction. I'm using gcc 7.5 on Linux. This is my test code:
#include <stdio.h>
#include <string.h>
int main()
{
char a[] = u8"gå";
int l = strlen(a);
for(int i=0; i<l; i++)
printf("%c - %d - %ld\n", a[i], (unsigned char)a[i], sizeof(a[i]));
printf("%d: %s\n", l, a);
return 0;
}
After running, I get this:
g - 103 - 1
� - 195 - 1
� - 165 - 1
3: gå
Which makes sense: it's using 2 bytes to encode the å
, and 1 byte to encode the g
, total, 3 bytes.
Then I remove the u8
prefix, and I get the same result. I might think, according to the standard, that gcc is actually using UTF-8 to encode strings by default. So far, so good.
But now I try something else: I restore the u8
prefix back again, and change the encoding of the source file to ISO-8859. And I get this:
g - 103 - 1
� - 229 - 1
2: g�
Not only the encoding has changed (it shouldn't have, as it's a u8
string), but the string prints incorrectly. If I remove the prefix again, I get this last result once more.
It's acting as if the u8
prefix is ignored, and the encoding is decided by the source file text encoding.
So my 2 questions here are:
- Why isn't the
u8
prefix doing anything? - Why isn't the string printing well when I encode my source code to ISO-8859?