How does C distinguish between a byte long character and a 2 byte long character?

Question

I have this sample code:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void){
    printf("%li\n",sizeof(char));
    char mytext[20];
    read(1,mytext,3);
    printf("%s",mytext);
    return 0;
}

First run:

koray@koray-VirtualBox:~$ ./a.out 
1
pp
pp
koray@koray-VirtualBox:~$

Well I think this is all expected as p is 1 byte long character defined in ASCII and I am reading 3 bytes. (2 p's and Line break) In the terminal, again I see 2 characters.

Now let's try with a character that is 2 bytes long:

koray@koray-VirtualBox:~$ ./a.out 
1
ğ
ğ

What I do not understand is, when I send the character 'ğ' to the memory pointed by mytext variable, 16 bits are written to that area. As 'ğ' is 11000100:10011110 in utf-8, these bytes are written.

My question is, when printing back to the standard out, how does C (or should I say the kernel?) know that, it should read 2 bytes and interpret as 1 character instead of 2 1-byte characters?

It doesn't, really. UTF-8 platforms tend to implement a `w_char` — David Hoelzer, May 21 '15 at 16:38
Two different sets of functions (for functions like printf) are used in C, one for ASCII, the other for UNICODE. Microsoft has an extension using where a program can use the same names, like TCHAR instead of char (ASCII) or WCHAR / wchar_t / unsigned short (UNICODE), _tprintf(), _T("...") for string literals, ... , that are either ASCII or UNICODE depending on project settings. — rcgldr, May 21 '15 at 23:12

score 5 · Accepted Answer · answered May 21 '15 at 16:40

5

C doesn't interpret it. Your program reads 2 bytes and outputs same 2 bytes without caring about what characters (or anything else) they are.

Your terminal encodes your input and reinterprets your output back as the same two byte character.

answered May 21 '15 at 16:40

Nicole A.

306
1
4

So if I had a dumb terminal, it could interpret it as 2 8 bit characters? – Koray Tugay May 21 '15 at 16:49
But how does the terminal know that that is a 2 byte character? – Koray Tugay May 21 '15 at 16:51
It looks at the first bit. Since it's set to 1, the terminal (or whatever is reading the string) knows it's not ascii, and that it containt 2 bytes or more. It will know if it's more by the content of the other bits. – RSinohara May 21 '15 at 16:59
Locale, terminal configuration, guessing. Think what happens when you output w/o further processing those bytes, hardcoded in your app. E.g. you may play with Konsole's profile encoding setting to display various characters/junk repeatedly running that simple app, by interpreting output differently. – Nicole A. May 21 '15 at 17:06

score 3 · Answer 2 · edited May 23 '17 at 11:51

3

Ascii range from 0 to 127. The first 128 characters of Unicode are the ascii caracters.

The first bit will tell if your character is in the 0-127 range or above it. If it's 1, it means it is unicode and 16 bits will be considered (or even more).

This question is closely related to: What's the difference between ASCII and Unicode?

edited May 23 '17 at 11:51

Community

1
1

answered May 21 '15 at 16:40

RSinohara

650
1
4
25

1

You could also take a look at http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences – RSinohara May 21 '15 at 16:42
Which process is looking at the first bit to determine the range? bash? What if I redirect the standard out to a file with > somefile.txt – Koray Tugay May 21 '15 at 16:53
Whatever is rendering the string will have to do this check. In your case, that was the terminal. – RSinohara May 21 '15 at 17:01
I see, so if I write the bytes to a text file, whatever process that opens it needs to check it? – Koray Tugay May 21 '15 at 17:04
Absolutely. Also, unicode is not always 2 bytes. Anything decoding it need to keep checking every character. – RSinohara May 21 '15 at 17:14
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78463/discussion-between-rsinohara-and-koray-tugay). – RSinohara May 21 '15 at 18:04
@KorayTugay All Unicode codepoints are numbered between 0 and less than 2^21 (0x10ffff). So, a 21 bit number is enough. Unicode has several encodings that spread those bits into one or more code-units in different ways. UTF-32 is simple: four bytes for any codepoints. UTF-8 seems to be the one you are using because its code-unit is one byte. So, think about how to put codepoints into one or more code-units with common codepoints in fewer code-units than less common codepoints. [Scroll down to Table 3-6](http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G7404). – Tom Blodget May 21 '15 at 22:46

How does C distinguish between a byte long character and a 2 byte long character?

2 Answers2