-3

I am having trouble getting the decimal values of UTF-8 characters and then converting them to binary (something like 12 = 0b110). For example, how can I transfer "ン" to its binary "11100011 10000011 10110011"?

I know that UTF-8 uses multiple bytes. I tried to print it out every 8 bits from left to right. For ASCII, I use the way below to print it out, but for UTF-8, what can I use?

char asc[10];

while ((c = getchar()) != EOF)
{
    int a = c;
    asc = DecimalToBinary(a);
    for (i = 7; i >= 0; i--)
    {
        printf("%c",*(asc + i));
    }
}

char *DecimalToBinary (int num) {
    static char binary[] = {'0', '0','0', '0','0', '0','0', '0'};
    int i = 0;
    while (num != 0) {
        if (num % 2 == 0)
        {
            binary[i++] = '0';
        }
        else {
            binary[i++] = '1';
        }
        num = num / 2;
    }
    return binary;
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Chenyu
  • 1
  • 1
  • 3
  • Do you mean a *string* of zeros and ones? – Biffen Mar 29 '16 at 06:40
  • 3
    You take the first byte, get its highest bit and print it, followed by the next highest bit, etc. Then you take the second byte and do the same. – Some programmer dude Mar 29 '16 at 06:41
  • 1
    check the wikipedia article about utf-8, for example and implement the algorithm. @JoachimPileborg: He wants to convert into the unicode code point, not the binary representation of the utf-8 code. The "binary expression" is no utf-8. – ikrabbe Mar 29 '16 at 06:46
  • 3
    Please [read about how to ask good questions](http://stackoverflow.com/help/how-to-ask), and learn how to create a [Minimal, Complete, and Verifiable Example](http://stackoverflow.com/help/mcve). Right now it's unclear what you want. For example, create a MCVE and show us, together with the input and output from that program. – Some programmer dude Mar 29 '16 at 06:57
  • 1
    @Biffen it actually is. – n. m. could be an AI Mar 29 '16 at 06:59
  • Do you know how to read a multi-byte character into an array of `char`? Do you know how to print the binary representation of a byte? What have you tried? Where was the problem? It appears that the character is U+30F3 (0xE3 0x83 0xB3 in UTF-8), aka KATAKANA LETTER N. If you want the binary representation of U+30F3, then you have to know how to decode a start byte and two continuation bytes in UTF-8. It's difficult to believe there isn't already a plethora of similar questions on SO. – Jonathan Leffler Mar 29 '16 at 07:10
  • sorry for the troubles, guys, this is my first time to ask questions, I am trying to modify my question now... – Chenyu Mar 29 '16 at 07:18
  • 1
    @Chenyu Please clarify whether you want the binary of the UTF-8 encoding of the character, or that of the codepoint. – Biffen Mar 29 '16 at 07:38
  • @Biffen sorry for my misleading, I just want to get decimal value of the UTF-8 and then convert it to binary (something like 12 = 110) – Chenyu Mar 29 '16 at 07:45
  • 1
    @Chenyu What's ‘*decimal value of the UTF-8*’?! UTF-8 uses eight-bit ‘code units’, and `ン` needs three such units. There's nothing decimal, and no single numeric value. If `ン` should result in `"00110000 11110011"` then you're using the *codepoint*, which, in itself, has nothing to do with UTF-8. – Biffen Mar 29 '16 at 07:50
  • @Biffen I got your point! yes, i just want to get the UTF-8 code point.. – Chenyu Mar 29 '16 at 10:04
  • 1
    @Chenyu There's no such thing as a ‘*UTF-8 codepoint*’. It's *either* UTF-8-encoded data *or* Unicode codepoint. Which one? – Biffen Mar 29 '16 at 10:05
  • I print out "ンニチハ" byte by byte: 11100011 10000011 10110011 11100011 10000011 10001011 11100011 10000011 10000001 11100011 10000011 10001111 00001010, these are what I want , are these Unicode codepoint? – Chenyu Mar 29 '16 at 10:33
  • @Chenyu No, that's UTF-8. But that's not what you showed in the question, where `ン` would become `00110000 11110011`. By your latest comment it should be `11100011 10000011 10110011`. – Biffen Mar 29 '16 at 10:50
  • It seems that I mix UTF-8 with Unicode...oh no.......@Biffen Thanks! – Chenyu Mar 29 '16 at 11:16
  • @Chenyu Then it should be no different from ASCII. How does your code *not* work with UTF-8? – Biffen Mar 29 '16 at 11:27
  • When the input is ascii, i use getchar() to get letter and then transfer it to binary form. But when faced with UTF-8, i cant use it anymore, so i change to fget() to get stdin, I got problem with fget(), how can i make sure the length of input string, so i can use for loop to transfer byte by byte. I have changed my original question above. – Chenyu Mar 29 '16 at 12:26
  • @Chenyu You edited it into a completely different question. Don't do that. Post a new question instead. And if this question is somehow not valid anymore then delete it. – Biffen Mar 29 '16 at 13:16

1 Answers1

1

If you need binary representation of the UTF-8 form then just print bit-by-bit of bytes.
If you need binary representation of the character then convert it to UTF-32 form and then to binary form.

See also:
UTF-8, UTF-16, and UTF-32
https://gist.github.com/antonijn/9009746
Conversion of Char to Binary in C

Community
  • 1
  • 1
tvorez
  • 421
  • 5
  • 6