-1

I have this sample code:

#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>

int main(void){
    printf("%li\n",sizeof(char));
    char mytext[20];
    read(1,mytext,3);
    printf("%s",mytext);
    return 0;
}

First run:

koray@koray-VirtualBox:~$ ./a.out 
1
pp
pp
koray@koray-VirtualBox:~$ 

Well I think this is all expected as p is 1 byte long character defined in ASCII and I am reading 3 bytes. (2 p's and Line break) In the terminal, again I see 2 characters.

Now let's try with a character that is 2 bytes long:

koray@koray-VirtualBox:~$ ./a.out 
1
ğ
ğ

What I do not understand is, when I send the character 'ğ' to the memory pointed by mytext variable, 16 bits are written to that area. As 'ğ' is 11000100:10011110 in utf-8, these bytes are written.

My question is, when printing back to the standard out, how does C (or should I say the kernel?) know that, it should read 2 bytes and interpret as 1 character instead of 2 1-byte characters?

Koray Tugay
  • 22,894
  • 45
  • 188
  • 319
  • @DavidSchwartz How does that help me? – Koray Tugay May 21 '15 at 16:38
  • It doesn't, really. UTF-8 platforms tend to implement a `w_char` – David Hoelzer May 21 '15 at 16:38
  • Two different sets of functions (for functions like printf) are used in C, one for ASCII, the other for UNICODE. Microsoft has an extension using where a program can use the same names, like TCHAR instead of char (ASCII) or WCHAR / wchar_t / unsigned short (UNICODE), _tprintf(), _T("...") for string literals, ... , that are either ASCII or UNICODE depending on project settings. – rcgldr May 21 '15 at 23:12

2 Answers2

5

C doesn't interpret it. Your program reads 2 bytes and outputs same 2 bytes without caring about what characters (or anything else) they are.

Your terminal encodes your input and reinterprets your output back as the same two byte character.

Nicole A.
  • 306
  • 1
  • 4
  • So if I had a dumb terminal, it could interpret it as 2 8 bit characters? – Koray Tugay May 21 '15 at 16:49
  • But how does the terminal know that that is a 2 byte character? – Koray Tugay May 21 '15 at 16:51
  • It looks at the first bit. Since it's set to 1, the terminal (or whatever is reading the string) knows it's not ascii, and that it containt 2 bytes or more. It will know if it's more by the content of the other bits. – RSinohara May 21 '15 at 16:59
  • Locale, terminal configuration, guessing. Think what happens when you output w/o further processing those bytes, hardcoded in your app. E.g. you may play with Konsole's profile encoding setting to display various characters/junk repeatedly running that simple app, by interpreting output differently. – Nicole A. May 21 '15 at 17:06
3

Ascii range from 0 to 127. The first 128 characters of Unicode are the ascii caracters.

The first bit will tell if your character is in the 0-127 range or above it. If it's 1, it means it is unicode and 16 bits will be considered (or even more).

This question is closely related to: What's the difference between ASCII and Unicode?

Community
  • 1
  • 1
RSinohara
  • 650
  • 1
  • 4
  • 25
  • 1
    You could also take a look at http://stackoverflow.com/questions/700187/unicode-utf-ascii-ansi-format-differences – RSinohara May 21 '15 at 16:42
  • Which process is looking at the first bit to determine the range? bash? What if I redirect the standard out to a file with > somefile.txt – Koray Tugay May 21 '15 at 16:53
  • Whatever is rendering the string will have to do this check. In your case, that was the terminal. – RSinohara May 21 '15 at 17:01
  • I see, so if I write the bytes to a text file, whatever process that opens it needs to check it? – Koray Tugay May 21 '15 at 17:04
  • Absolutely. Also, unicode is not always 2 bytes. Anything decoding it need to keep checking every character. – RSinohara May 21 '15 at 17:14
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/78463/discussion-between-rsinohara-and-koray-tugay). – RSinohara May 21 '15 at 18:04
  • @KorayTugay All Unicode codepoints are numbered between 0 and less than 2^21 (0x10ffff). So, a 21 bit number is enough. Unicode has several encodings that spread those bits into one or more code-units in different ways. UTF-32 is simple: four bytes for any codepoints. UTF-8 seems to be the one you are using because its code-unit is one byte. So, think about how to put codepoints into one or more code-units with common codepoints in fewer code-units than less common codepoints. [Scroll down to Table 3-6](http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf#G7404). – Tom Blodget May 21 '15 at 22:46