To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.
It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.
The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _
) seem particular telling?
#include <stdio.h>
#include <string.h>
int main() {
char str[] = "João"; // ã does not belong to the standard
// (or extended) ASCII characters
printf("number of chars = %d\n", (int)strlen(str)); // 5
int len = 0;
while (str[len] != '\0')
len++;
printf("number of bytes = %d\n", len); // 5
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// "João"
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__�__�__o_ -> wow!!!
str[2] = 'X'; // let's change this special character
// and see what happens
for (int i = 0; i < len; i++)
printf("%c", str[i]);
puts("");
// JoX�o
for (int i = 0; i < len; i++)
printf("_%c_", str[i]);
puts("");
// _J__o__X__�__o_
}
I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?