How to convert a UTF-8 character to binary in C

Question

I am having trouble getting the decimal values of UTF-8 characters and then converting them to binary (something like 12 = 0b110). For example, how can I transfer "ン" to its binary "11100011 10000011 10110011"?

I know that UTF-8 uses multiple bytes. I tried to print it out every 8 bits from left to right. For ASCII, I use the way below to print it out, but for UTF-8, what can I use?

char asc[10];

while ((c = getchar()) != EOF)
{
    int a = c;
    asc = DecimalToBinary(a);
    for (i = 7; i >= 0; i--)
    {
        printf("%c",*(asc + i));
    }
}

char *DecimalToBinary (int num) {
    static char binary[] = {'0', '0','0', '0','0', '0','0', '0'};
    int i = 0;
    while (num != 0) {
        if (num % 2 == 0)
        {
            binary[i++] = '0';
        }
        else {
            binary[i++] = '1';
        }
        num = num / 2;
    }
    return binary;
}

You take the first byte, get its highest bit and print it, followed by the next highest bit, etc. Then you take the second byte and do the same. — Some programmer dude, Mar 29 '16 at 06:41
check the wikipedia article about utf-8, for example and implement the algorithm. @JoachimPileborg: He wants to convert into the unicode code point, not the binary representation of the utf-8 code. The "binary expression" is no utf-8. — ikrabbe, Mar 29 '16 at 06:46
Please [read about how to ask good questions](http://stackoverflow.com/help/how-to-ask), and learn how to create a [Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve). Right now it's unclear what you want. For example, create a MCVE and show us, together with the input and output from that program. — Some programmer dude, Mar 29 '16 at 06:57
Do you know how to read a multi-byte character into an array of `char`? Do you know how to print the binary representation of a byte? What have you tried? Where was the problem? It appears that the character is U+30F3 (0xE3 0x83 0xB3 in UTF-8), aka KATAKANA LETTER N. If you want the binary representation of U+30F3, then you have to know how to decode a start byte and two continuation bytes in UTF-8. It's difficult to believe there isn't already a plethora of similar questions on SO. — Jonathan Leffler, Mar 29 '16 at 07:10
sorry for the troubles, guys, this is my first time to ask questions, I am trying to modify my question now... — Chenyu, Mar 29 '16 at 07:18
@Chenyu Please clarify whether you want the binary of the UTF-8 encoding of the character, or that of the codepoint. — Biffen, Mar 29 '16 at 07:38
@Biffen sorry for my misleading, I just want to get decimal value of the UTF-8 and then convert it to binary (something like 12 = 110) — Chenyu, Mar 29 '16 at 07:45
@Chenyu What's ‘*decimal value of the UTF-8*’?! UTF-8 uses eight-bit ‘code units’, and `ン` needs three such units. There's nothing decimal, and no single numeric value. If `ン` should result in `"00110000 11110011"` then you're using the *codepoint*, which, in itself, has nothing to do with UTF-8. — Biffen, Mar 29 '16 at 07:50
@Biffen I got your point! yes, i just want to get the UTF-8 code point.. — Chenyu, Mar 29 '16 at 10:04
@Chenyu There's no such thing as a ‘*UTF-8 codepoint*’. It's *either* UTF-8-encoded data *or* Unicode codepoint. Which one? — Biffen, Mar 29 '16 at 10:05
I print out "ンニチハ" byte by byte: 11100011 10000011 10110011 11100011 10000011 10001011 11100011 10000011 10000001 11100011 10000011 10001111 00001010, these are what I want , are these Unicode codepoint? — Chenyu, Mar 29 '16 at 10:33
@Chenyu No, that's UTF-8. But that's not what you showed in the question, where `ン` would become `00110000 11110011`. By your latest comment it should be `11100011 10000011 10110011`. — Biffen, Mar 29 '16 at 10:50
It seems that I mix UTF-8 with Unicode...oh no.......@Biffen Thanks! — Chenyu, Mar 29 '16 at 11:16
@Chenyu Then it should be no different from ASCII. How does your code *not* work with UTF-8? — Biffen, Mar 29 '16 at 11:27
When the input is ascii, i use getchar() to get letter and then transfer it to binary form. But when faced with UTF-8, i cant use it anymore, so i change to fget() to get stdin, I got problem with fget(), how can i make sure the length of input string, so i can use for loop to transfer byte by byte. I have changed my original question above. — Chenyu, Mar 29 '16 at 12:26
@Chenyu You edited it into a completely different question. Don't do that. Post a new question instead. And if this question is somehow not valid anymore then delete it. — Biffen, Mar 29 '16 at 13:16

score 1 · Answer 1 · edited May 23 '17 at 11:50

1

If you need binary representation of the UTF-8 form then just print bit-by-bit of bytes.
If you need binary representation of the character then convert it to UTF-32 form and then to binary form.

See also:
UTF-8, UTF-16, and UTF-32
https://gist.github.com/antonijn/9009746
Conversion of Char to Binary in C

edited May 23 '17 at 11:50

Community

1
1

answered Mar 29 '16 at 06:53

tvorez

421
5
6

How to convert a UTF-8 character to binary in C

1 Answers1