I have to make a code that takes characters with UTF-8 encoding and "translate"them into Unicode. You can check here what a UTF-8 is https://en.wikipedia.org/wiki/UTF-8. I am a C beginner so I have three restrictions placed on me:
- I must use
getchar()
- It is forbidden to use arrays
- I am only interested in Unicode characters with 1,2,3 and 4 bytes
So I have this code which is totally functional for 4 bytes(I know I must use != EOF
for every getchar();
but for now this is not my problem)
#include <stdio.h>
int main(void) {
int ch1, ch2, ch3, ch4, c;
ch1 = getchar();
ch2 = getchar();
ch3 = getchar();
ch4 = getchar();
if ((ch1 & 0xF8) != 0xF0 || (ch2 & 0xC0) != 0x80 ||
(ch3 & 0xC0) != 0x80 || (ch4 & 0xC0) != 0x80) {
printf("Error in UTF-8 4-byte encoding\n");
return 1;
}
c = ((ch1 & 0x07) << 18) | ((ch2 & 0x3F) << 12) |
((ch3 & 0x3F) << 6) | (ch4 & 0x3F);
printf("c = %05X\n", c);
return 0;
}
My question: I cannot understand how I can use getchar()
for 1-2-3 bytes. I mean, I must read all the getchar
functions in the beginning and then use ch1
for 1-byte characters and ch1
, ch2
for 2 bytes characters OR I must do it like this. (By the way, the code below it is not functional, it gives me an infinite loop; I just use it as a example of my thought.)
#include <stdio.h>
int main (void) {
int ch1, ch2, ch3, ch4, c;
if (c >=0x0000 && c<=0x007F ){
ch1=getchar();
while (ch1 !=EOF){
if ((ch1 & 0x80) != 0x00) {
printf("Error in UTF-8 1-byte encoding\n");
return 1;
}
c = ((ch1 & 0x80) << 7);
printf("c = %05X\n", c);
}
}