Convert Unicode code points to UTF-8 and UTF-32

Question

I can't think of a way to remove the leading zeros. My goal was in a for loop to then create the UTF-8 and UTF-32 versions of each number.

For example, with UTF-8 wouldn't I have to remove the leading zeros? Does anyone have a solution for how to pull this off? Basically what I am asking is: does someone have a easy solution to convert Unicode code points to UTF-8?

    for (i = 0x0; i < 0xffff; i++) {
        printf("%#x \n", i);
        //convert to UTF8
    }

So here is an example of what I am trying to accomplish for each i.

For example: Unicode value U+0760 (Base 16) would convert to UTF8 as
- in binary: 1101 1101 1010 0000
- in hex: DD A0

Basically I am trying to do that for every i is convert it to its hex equivalent in UTF-8.

The problem I am running into is it seems the process for converting Unicode to UTF-8 involves removing leading 0s from the bit number. I am not really sure how to do that dynamically.

When you say "of each number" do you mean that the integer 1234 will produce the string "1234" UTF-8 encoded? Or do you mean it will produce the character represented by 1234 in UTF-8? (Spoiler: there isn't one) Or do you mean the 1234th Unicode code point? — Schwern, Feb 02 '17 at 21:28
@Schwern my goal was to convert all Unicode characters 0x0 to 0x10FFFF to UTF8 and UTF32 form.. — Joe Caraccio, Feb 02 '17 at 21:30
That's still ambiguous. Could you give a solid example? And did you mean 0x0 to 0xFFFF? That's what your code is doing. — Schwern, Feb 02 '17 at 21:30
so my first approach was to loop through from that to 10FFFF, the only problem is I am not sure how to actually convert to utf from a code standpoint — Joe Caraccio, Feb 02 '17 at 21:32
There is detailed information on wikipedia which bits in the Unicode code point go into the bytes of a UTF-8 encoding. Is it that what you are looking for? — Jens Gustedt, Feb 02 '17 at 21:34
Ok, so you are talking code points. When `i` is 03F4 you want ϴ, right? — Schwern, Feb 02 '17 at 21:35
@Schwern , yeah, exactly! sorry if I was vague.. thats what I was trying to accomplish — Joe Caraccio, Feb 02 '17 at 21:36
@JensGustedt under the Unicode page? I can't seem to find it — Joe Caraccio, Feb 02 '17 at 21:36
There are many ways to convert code-points 0 to 10FFFF to a small sequence of bytes for UTF-8. Of course the [surrogates](http://unicode.org/glossary/#surrogate_code_point) do not convert. — chux - Reinstate Monica, Feb 02 '17 at 21:37
i understand the conversion part.. my issue is accomplishing it, you drop the leading 0s and then assign it somehow. not sure how you would remove the leading 0s from the hex value — Joe Caraccio, Feb 02 '17 at 21:42
Post an example of input and desired output for a single `i`. — chux - Reinstate Monica, Feb 02 '17 at 21:43
Note the fun parts like U+D800 .. U+DFFF are only valid as UTF-16 surrogates, so you shouldn't be trying to generate those as part of the BMP. — Jonathan Leffler, Feb 02 '17 at 22:01
Formatting the UTF-32 is pretty straight-forward, isn't it? You take the value in `i` and format it in hex (`"U+%.4X" or thereabouts). That generates `U+0064` or `U+0760` or `U+10FFFF`. If you don't want the `U+`, drop it from the format. Handling UTF-8 is fiddlier; you probably end up with separate cases for 1, 2, 3 and 4 byte sequences — noting that the BMP (U+0000 .. U+FFFF) can be encoded in 3 bytes of UTF-8. Since you've not shown which leading zeros you are getting that need to be removed, it is hard to know how to help you remove them. — Jonathan Leffler, Feb 02 '17 at 22:05
You do not "remove the leading zeroes" from the hex value. You *ignore* them (after figuring out how many there are, so as to perform the proper conversion). This is fundamentally an arithmetic problem, and leading zeroes are arithmetically insignificant. — John Bollinger, Feb 02 '17 at 22:26
@JonathanLeffler You're right about UTF-32, though make sure you check `i` is not out of range. But `U+XXXX` indicates a Unicode code point, not a character encoding. You wouldn't describe a character encoding like UTF-32 using that format, you'd just use the hex. — Schwern, Feb 02 '17 at 23:22
@Schwern: my discussion of UTF-32 is skimpy — there could be issues with UTF-32BE vs UTF-32LE and such like issues. The `"U+%.4X"` format is perhaps most appropriate for identifying each line (entry). At some point, someone should point out that plane 15 (U+F0000..U+FFFFF) and plane 16 (U+100000..U+10FFFF) are private use areas (there are also much smaller private use areas in the BMP, aka plane 0), and that planes 1, 2 and 14 are used (Supplementary Multilingual Plane, Supplementary Ideographic Plane, and Supplementary Special-Purpose Plane) but planes 3 to 13 are currently unused. Etc. — Jonathan Leffler, Feb 02 '17 at 23:33

Nominal Animal · Answer 1 · 2017-02-04T15:26:04.993

As the Wikipedia UTF-8 page describes, each Unicode code point (0 through 0x10FFFF) is encoded in UTF-8 character as one to four bytes.

Here is a simple example function, edited from one of my earlier posts. I've now removed the U suffixes from the integer constants too. (.. whose intent was to remind the human programmer that the constants are explicitly unsigned for a reason (negative code points not considered at all), and it does assume unsigned int code -- the compiler does not care, and probably because of that this practice seems to be odd and confusing even to long-term members here, so I give up and stop trying to include such reminders. :( )

static size_t code_to_utf8(unsigned char *const buffer, const unsigned int code)
{
    if (code <= 0x7F) {
        buffer[0] = code;
        return 1;
    }
    if (code <= 0x7FF) {
        buffer[0] = 0xC0 | (code >> 6);            /* 110xxxxx */
        buffer[1] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 2;
    }
    if (code <= 0xFFFF) {
        buffer[0] = 0xE0 | (code >> 12);           /* 1110xxxx */
        buffer[1] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[2] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 3;
    }
    if (code <= 0x10FFFF) {
        buffer[0] = 0xF0 | (code >> 18);           /* 11110xxx */
        buffer[1] = 0x80 | ((code >> 12) & 0x3F);  /* 10xxxxxx */
        buffer[2] = 0x80 | ((code >> 6) & 0x3F);   /* 10xxxxxx */
        buffer[3] = 0x80 | (code & 0x3F);          /* 10xxxxxx */
        return 4;
    }
    return 0;
}

You supply it with an unsigned char array, four chars or larger, and the Unicode code point. The function will return how many chars were needed to encode the code point in UTF-8, and were assigned in the array. The function will return 0 (not encoded) for codes above 0x10FFFF, but it does not otherwise check that the Unicode code point is valid. Ie. it is a simple encoder, and all it knows about Unicode is that the code points are from 0 to 0x10FFFF, inclusive. It knows nothing about surrogate pairs, for example.

Note that because the code point is explicitly an unsigned integer, negative arguments will be converted to unsigned according to C rules.

You need to write a function that prints out the least 8 significant bits in each unsigned char (the C standard does allow larger char sizes, but UTF-8 only uses 8-bit chars). Then, use the above function to convert an Unicode code point (0 to 0x10FFFF, inclusive) to UTF-8 representation, and call your bit function for each unsigned char in the array, in increasing order, for the count of unsigned char the above conversion function returned for that code point.

All these `U` suffixes on the integral literals are redundant and confusing. They are not needed for the `unsigned` argument type. Btw: I don't know why you did not use the suffix for `0xFFFF` and `0x10FFFF`, but removing them everywhere would IMHO improve readability. — chqrlie, Feb 04 '17 at 14:44
@chqrlie: See comments [here](http://stackoverflow.com/a/42013397/1475978). If I remove them, I'll remove the "hint" (wrt. behaviour change if you switch the parameter to `int code`). Eff it, it's not worth the effort. Will edit. — Nominal Animal, Feb 04 '17 at 15:20
indeed not worth it, chux updated his own answer and removed the `U` suffixes too. — chqrlie, Feb 04 '17 at 15:48
@chqrlie: No, chux never had them. I do like them myself, because they tell me that there is a specific reason (although not what that reason is) for the unsignedness. I need to find a better way to include such hints, that certain modifications may include unexpected side effects that are intentionally avoided by the current form of the code. I do believe they help those who read and experiment with the code in order to learn from it. Also, such error-inducing changes are a form of "trap" to those who simply take code and modify it minimally to present as their own coursework. Suggestions? — Nominal Animal, Feb 04 '17 at 18:42
(In any larger project or production code, I include such assumptions and choices in comment blocks before the implementation; interface choices and reasons for them in comment blocks before the declaration. Of course, there is no sense in any kind of "traps" then. I do try to keep a big difference between learning examples and actual production stuff.) — Nominal Animal, Feb 04 '17 at 18:45
we seem to be in sync, you posted the follow up comment as I was writing that such considerations should be explained in comments. I do agree it is not a perfect solution, but it does allow for a more pedagogical approach as the explanation can be more explicit than any hint in the code. If the unsuspecting programmer ignores the comment, as many do, there is not much we can do. It takes years of C practice to acquire humility. — chqrlie, Feb 04 '17 at 18:50
Agreed. I do have to admit, I find writing *good* comments even harder... Much, much harder than the actual code. — Nominal Animal, Feb 04 '17 at 18:55

Schwern · Answer 2 · 2017-02-04T20:39:49.297

Converting to UTF-32 is trivial, it's just the Unicode code point.

#include <wchar.h>

wint_t codepoint_to_utf32( const wint_t codepoint ) {
    if( codepoint > 0x10FFFF ) {
        fprintf( stderr, "Codepoint %x is out of UTF-32 range\n", codepoint);
        return -1;
    }

    return codepoint;
}

Note that I'm using wint_t, w for "wide". That's an integer which is guaranteed to be large enough to hold any wchar_t as well as EOF. wchar_t (wide character) is guaranteed to be wide enough to support all system locales.

Converting to UTF-8 is a bit more complicated because of its codepage layout designed to be compatible with 7-bit ASCII. Some bit shifting is required.

Start with the UTF-8 table.

U+0000  U+007F    0xxxxxxx
U+0080  U+07FF    110xxxxx  10xxxxxx
U+0800  U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
U+10000 U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx

Turn that into a big if/else if statement.

wint_t codepoint_to_utf8( const wint_t codepoint ) {
    wint_t utf8 = 0;

    // U+0000   U+007F    0xxxxxxx
    if( codepoint <= 0x007F ) {
    }
    // U+0080   U+07FF    110xxxxx  10xxxxxx
    else if( codepoint <= 0x07FF ) {
    }
    // U+0800   U+FFFF    1110xxxx  10xxxxxx    10xxxxxx
    else if( codepoint <= 0xFFFF ) {
    }
    // U+10000  U+10FFFF  11110xxx  10xxxxxx    10xxxxxx    10xxxxxx
    else if( codepoint <= 0x10FFFF ) {
    }
    else {
        fprintf( stderr, "Codepoint %x is out of UTF-8 range\n", codepoint);
        return -1;
    }

    return utf8;
}

And start filling in the blanks. The first one is easy, it's just the code point.

    // U+0000   U+007F    0xxxxxxx
    if( codepoint <= 0x007F ) {
        utf8 = codepoint;
    }

To do the next one, we need to apply a bit mask and do some bit shifting. C doesn't support binary literals, so I converted the binary into hex using perl -wle 'printf("%x\n", 0b1100000010000000)'

    // U+0080   U+07FF    110xxxxx  10xxxxxx
    else if( codepoint <= 0x00007FF ) {
        // Start at 1100000010000000
        utf8 = 0xC080;

        // 6 low bits using the bitmask 00111111
        // That fills in the 10xxxxxx part.
        utf8 += codepoint & 0x3f;

        // 5 high bits using the bitmask 11111000000
        // Shift over 2 to jump the hard coded 10 in the low byte.
        // That fills in the 110xxxxx part.
        utf8 += (codepoint & 0x7c0) << 2;
    }

I'll leave the rest to you.

We can test this with various interesting values that touch each piece of logic.

int main() {    
    // https://codepoints.net/U+0041
    printf("LATIN CAPITAL LETTER A: %x\n", codepoint_to_utf8(0x0041));
    // https://codepoints.net/U+00A2
    printf("Cent sign: %x\n", codepoint_to_utf8(0x00A2));
    // https://codepoints.net/U+2603
    printf("Snowman: %x\n", codepoint_to_utf8(0x02603));
    // https://codepoints.net/U+10160
    printf("GREEK ACROPHONIC TROEZENIAN TEN: %x\n", codepoint_to_utf8(0x10160));

    printf("Out of range: %x\n", codepoint_to_utf8(0x00200000));
}

This is an interesting exercise, but if you want to do this for real use a pre-existing library. Gnome Lib has Unicode manipulation functions, and a lot more missing pieces of C.

The initial test `if (codepoint > 0x001FFFFF)` is incorrect. It should be `if (codepoint > 0x0010FFFF)` and if type `win_t` is signed, an extra test for negative values is needed too. To make things worse, `win_t` can have as few as 15 value bits... such a mess. — chqrlie, Feb 04 '17 at 15:47
@chqrlie Thanks, I fat fingered the UTF-32 check. As for `wint_t`, by my reading it must be large enough to hold all locales on the system, and assuming the system has UTF-8 seems reasonable (I thought about putting in an assert, but it's 2017); is that not true? It does drag in negatives. `uint32_t` might be better, but it makes returning an error code difficult. — Schwern, Feb 04 '17 at 20:52
You could use `int32_t`. A `wchar_t` is usually 16 bits wide on Windows systems, which is decidedly *not* enough to hold any Unicode Codepoint. — Smiley1000, Jan 05 '22 at 23:42
@Smiley1000 Microsoft violating the standard is pretty standard. I guess that's legacy when they went all in on UTF-16. I don't pretend to understand the mash of C and C++ in Microsoft compilers, but they claim you can use `char8_t` for UTF-8.https://learn.microsoft.com/en-us/cpp/cpp/char-wchar-t-char16-t-char32-t?view=msvc-170 and an interesting proposal to [backport char8_t to C](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2653.htm). — Schwern, Jan 06 '22 at 00:12
Right, you can use that. I think it's best to use the explicitly-sized character types (`char8_t`, `char16_t`, `char32_t`). — Smiley1000, Jan 09 '22 at 23:21

score 0 · Answer 3 · answered Feb 02 '17 at 22:20

Many ways to do this fun exercise, converting a code point to UTF-8.

As not to give it all the coding experience away, following is a pseudo code to get OP started.

#define UTF_WIDTH1_MAX       0x7F
#define UTF_WIDTH2_MAX       0x7FF
#define UTF_WIDTH3_MAX       0xFFFF
#define UTF_WIDTH4_MAX       0x10FFFF

void PrintCodepointUTF8(uint32_t codepoint) {
  uint8_t first;
  uint8_t continuation_bytes[3];
  unsigned continuation_bytes_n;
  if (codepoint <= UTF_WIDTH1_MAX) {
    first = codepoint;
    continuation_bytes = 0;
  } else if (codepoint <= UTF_WIDTH2_MAX) {
    // extract 5 bits for first and 6 bits for one continuation_byte
    // and set some bits
    first = ...;
    continuation_bytes = ...
    continuation_bytes_n = 1;
  } else   if (codepoint <= UTF_WIDTH4_MAX) {
    if (isasurrogate(codepoint)) fail.
    // else extract 4 bits for first and 6 bits for each continuation_byte
    // and set some bits
    first = ...;
    continuation_bytes = ...
    continuation_bytes_n = 2;
  } else   if (codepoint <= UTF_WIDTH4_MAX) {
    // extract 3 bits for first and 6 bits for each continuation_byte
    // and set some bits
    first = ...;
    continuation_bytes = ...
    continuation_bytes_n = 3;
  } else {
    fail out of range.
  }
  print first and 0-3 continuation_bytes
}

I really hope printing in binary is a/the key point here, because otherwise I just basically handed OP their code on a platter (in my answer; didn't see yours while I was writing mine). Hate that; prefer to help learn, not skip work. — Nominal Animal, Feb 02 '17 at 22:27
@NominalAnimal Hmmm, a positive [different stroke for different folks](http://www.urbandictionary.com/define.php?term=different%20strokes%20for%20different%20folks). BTW: why decimal constant in `code < 1114112U`? Lots of `U`s too. Not needed, but not bad either. A surrogate test would be nice there too - now maybe thats too much for OP, maybe add next week. — chux - Reinstate Monica, Feb 02 '17 at 22:57
Didn't notice it was decimal! :) Because I explicitly use an unsigned integer argument for the code point, I reject negative codepoints as too large (because they get converted to unsigned int according to C rules). The `U`s are there to raise the very question you posed; i.e. *"why are these constants explicitly unsigned?". I assume that those who blindly copy the code will just convert the `unsigned int`s to `int`s and drop the `U`s, which will make the code fail for negative codepoints. If I were testing others' code for this, I'd test a common bug: `EOF` or `WEOF`. — Nominal Animal, Feb 03 '17 at 04:16

Convert Unicode code points to UTF-8 and UTF-32

3 Answers3

Linked