2

I have to make a code that takes characters with UTF-8 encoding and "translate"them into Unicode. You can check here what a UTF-8 is https://en.wikipedia.org/wiki/UTF-8. I am a C beginner so I have three restrictions placed on me:

  1. I must use getchar()
  2. It is forbidden to use arrays
  3. I am only interested in Unicode characters with 1,2,3 and 4 bytes

So I have this code which is totally functional for 4 bytes(I know I must use != EOF for every getchar(); but for now this is not my problem)

#include <stdio.h>

int main(void) {
        int ch1, ch2, ch3, ch4, c;
        ch1 = getchar();
        ch2 = getchar();
        ch3 = getchar();
        ch4 = getchar();
        if ((ch1 & 0xF8) != 0xF0 || (ch2 & 0xC0) != 0x80 ||
                        (ch3 & 0xC0) != 0x80 || (ch4 & 0xC0) != 0x80) {
                printf("Error in UTF-8 4-byte encoding\n");
                return 1;
        }
        c = ((ch1 & 0x07) << 18) | ((ch2 & 0x3F) << 12) |
                        ((ch3 & 0x3F) << 6) | (ch4 & 0x3F);
        printf("c = %05X\n", c);
        return 0;
}

My question: I cannot understand how I can use getchar() for 1-2-3 bytes. I mean, I must read all the getchar functions in the beginning and then use ch1 for 1-byte characters and ch1, ch2 for 2 bytes characters OR I must do it like this. (By the way, the code below it is not functional, it gives me an infinite loop; I just use it as a example of my thought.)

#include <stdio.h>

int main (void) {
        int ch1, ch2, ch3, ch4, c;

        if (c >=0x0000 && c<=0x007F ){
             ch1=getchar();
            while (ch1 !=EOF){
                if ((ch1 & 0x80) != 0x00) {
                    printf("Error in UTF-8 1-byte encoding\n");
                    return 1;   
                   }
                 c = ((ch1 & 0x80) << 7);
                 printf("c = %05X\n", c);
                }
        }
Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
navarian
  • 41
  • 1
  • 10
  • Note that UTF-8 never needs more than 4 bytes because Unicode limits itself to the range U+0000 .. U+10FFFF. Indeed, some bytes — 0xC0, 0xC1, and 0xF5 .. 0xFF cannot appear in valid UTF-8. See also [Really Good, Bad UTF-8 Example Test Data](https://stackoverflow.com/questions/1319022/really-good-bad-utf-8-example-test-data) – Jonathan Leffler Dec 12 '15 at 14:33

1 Answers1

7

You can't do it by first reading four characters and then deciding what to do. If the character is in 0x00-0x7f, you'll be throwing the rest out, or you have to handle them in a more difficult way.

The proper way is to read one character. It will tell you how many extra characters you need, if any, based on the most significant bits being 1s. Then read the extra ones and convert to a proper UNICODE code point by shifting and dismissing the most significant bits when needed.

You can check the documentation you linked to to see how the bits of the UNICODE code point are distributed to several bytes. Here is also a brief explanation of the algorithm:

  • Read one byte
  • If the topmost bit is zero, there is nothing else to do: the code point is 0x00-0x7f
  • If the topmost three bits are 110, then you need one extra byte. Take five lowest bits of the first byte, shift them left six bits and OR the lowest six bits from the second byte to get the final value
  • If the topmost four bits are 1110, then you need two extra bytes. Take four lowest bits of the first one, shift by 12 bits, or in the six lowest bits from the second byte shifted by six, then finally the six lowest bits of the third byte
  • If the topmost five bits are 11110, then you need three extra bytes and will read them, shift etc as previously
  • If none of those conditions fit, the data is invalid
  • Note that when reading extra bytes, those bytes must have 10 as the most significant bits; anything else is invalid.

The lower code won't even work, since c is never given a value, so the if condition will be undefined. It doesn't check the bytes properly either, so that code won't help you much.

Jonathan Leffler
  • 730,956
  • 141
  • 904
  • 1,278
Sami Kuhmonen
  • 30,146
  • 9
  • 61
  • 74
  • ok i understood your answer about the lower code..but i am not sure i fully understand what you mean in your second paragraph.Can you explain it more please? – navarian Dec 12 '15 at 14:20
  • @kostasdi Added explanation on what to do – Sami Kuhmonen Dec 12 '15 at 14:25
  • 2
    This covers the basics — it will read valid UTF-8 and reject egregiously invalid UTF-8 (and may be sufficient for the OP). For full validation, there are some additional requirements, such as: rejecting non-minimal encodings (0xC0 0x80 is a non-minimal and hence invalid encoding for U+0000; the valid encoding is 0x00); UTF-16 surrogates (U+D800..U+DFFF) are not allowed; values outside the range U+0000..U+10FFFF are invalid. – Jonathan Leffler Dec 12 '15 at 14:43