1

To investigate how C deals with UTF-8 / Unicode characters, I did this little experiment.

It's not that I'm trying to solve anything particular at the moment, but I know that Java deals with the whole encoding situation in a transparent way to the coder and I was wondering how C, that is a lot lower level, treats its characters.

The following test seems to indicate that C is entirely ignorant about encoding concerns, as that it's just up to the display device to know how to interpret the sequence of chars when showing them on screen. The later tests (when printing the characters surrounded by _) seem particular telling?

#include <stdio.h>
#include <string.h>

int main() {
    char str[] = "João"; // ã does not belong to the standard 
                         // (or extended) ASCII characters

    printf("number of chars = %d\n", (int)strlen(str)); // 5

    int len = 0;
    while (str[len] != '\0')
        len++;
    printf("number of bytes = %d\n", len); // 5

    for (int i = 0; i < len; i++)
        printf("%c", str[i]);
    puts("");
    // "João"

    for (int i = 0; i < len; i++)
        printf("_%c_", str[i]);
    puts("");
    // _J__o__�__�__o_ -> wow!!!

    str[2] = 'X'; // let's change this special character
                  // and see what happens
    for (int i = 0; i < len; i++)
        printf("%c", str[i]);
    puts("");
    // JoX�o

    for (int i = 0; i < len; i++)
        printf("_%c_", str[i]);
    puts("");
    // _J__o__X__�__o_
} 

I have knowledge of how ASCII / UTF-8 work, what I'm really unsure is on at what moment do the characters get interpreted as "compound" characters, as it seems that C just treats them as dumb bytes. What's really the science behind this?

devoured elysium
  • 101,373
  • 131
  • 340
  • 557
  • Do you save the C file as utf-8 or MBCS or UCS-16? Linux or Windows (which version)? A short answer is that Windows 10 tends to fully support UTF-8, and UTF-8 becomes the common encoding everywhere. Then a UTF-8 string in C indeed is a dumb array of bytes, unless you try to split it to characters. – ddbug Jul 16 '19 at 23:44
  • Hi! I'm on Ubuntu 16.04 LTS. CLion indicates it's a UTF-8 file (as expected in Linux). – devoured elysium Jul 16 '19 at 23:45
  • 2
    *The C Programming Language* was published in 1978. Unicode started in 1991. Should it surprise you that C doesn't handle character encodings well? – Lee Daniel Crocker Jul 16 '19 at 23:55
  • C is unfortunate language where a character must be integral type. In all normal languages a character is a string that is one character long. No matter how it is represented internally and how many bytes it needs. So you can extract one character and print it an it will work. In C, char is one byte which can be any part of UTF-8 character. Or, resort to longer integers (16, 32 bits...). If you are interested in C++ too, read about char8_t. – ddbug Jul 16 '19 at 23:55
  • `for (int i = 0; i < sizeof str; i++) printf("%d", str[i]);` to reveal details and see true encoding. – chux - Reinstate Monica Jul 17 '19 at 00:04
  • @LeeDanielCrocker: I never said I was. The question is: if C doesn't care, at which point are the encoded characters assembled? – devoured elysium Jul 17 '19 at 00:04
  • @chux I did that too, yes. – devoured elysium Jul 17 '19 at 00:04
  • devoured elysium, perhaps you did on your machine but that result is not here. Notice `sizeof` and `"%d"`. – chux - Reinstate Monica Jul 17 '19 at 00:05
  • 1
    @devouredelysium If the characters are sent to a graphic device (terminal, printer), it assembles them to visible form. If characters are sent to file, they remain dumb arrays of bytes. The C language plays very small role in this. C++ streams know about encoding more, but this knowledge is devalued by universal use of UTF-8 . – ddbug Jul 17 '19 at 00:34
  • At least three things to be aware of that are relevant to character encodings in C: source-charset, exec-charset and locales. (The first two are very simple. The third, I probably wouldn't explain sufficiently or accurately in general.) – Tom Blodget Jul 17 '19 at 16:18

2 Answers2

1

The printing isn't a function of C, but of the display context, whatever that is. For a terminal there are UTF-8 decoding functions which map the raw character data into the character to be shown on screen using a particular font. A similar sort of display logic happens in graphical applications, though with even more complexity relating to proportional font widths, ligatures, hyphenation, and numerous other typographical concerns.

Internally this is often done by decoding UTF-8 into some intermediate form first, like UTF-16 or UTF-32, for look-up purposes. In extremely simple terms, each character in a font has a Unicode identifier. In practice this is a lot more complicated as there is room for character variants, and multiple characters may be represented by a singular character in a font, like "fi" and "ff" ligatures. Accented characters like "ç" may be a combination of characters, as allowed by Unicode. That's where things like Zalgo text come about: you can often stack a truly ridiculous number of Unicode "combining characters" together into a single output character.

Typography is a complex world with complex libraries required to render properly.

You can handle UTF-8 data in C, but only with special libraries. Nothing that C ships with in the Standard Library can understand them, to C it's just a series of bytes, and it assumes byte is equivalent to character for the purposes of length. That is strlen and such work with bytes as a unit, not characters.

C++, as an example, has much better support for this distinction between byte and character. Other languages have even better support, with languages like Swift having exceptional support for UTF-8 specifically and Unicode in general.

tadman
  • 208,517
  • 23
  • 234
  • 262
0

printf("_%c_", str[i]); prints the character associated with each str[i] - one at a time.

The value of char str[i] is converted to an int when passed ot a ... function. The int value is then converted to unsigned char as directed by "%c" and "and the resulting character is written".

char str[] = "João"; does not certainly specify a UTF8 sequence. That in an implementation detail. A specified way is to use char str[] = u8"João"; since C11 (or maybe C99).

printf() does not specify a direct way to print UTF8 stirrings.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256