Converting to UTF-32 is trivial, it's just the Unicode code point.
#include <wchar.h>
wint_t codepoint_to_utf32( const wint_t codepoint ) {
if( codepoint > 0x10FFFF ) {
fprintf( stderr, "Codepoint %x is out of UTF-32 range\n", codepoint);
return -1;
}
return codepoint;
}
Note that I'm using wint_t
, w for "wide". That's an integer which is guaranteed to be large enough to hold any wchar_t
as well as EOF. wchar_t
(wide character) is guaranteed to be wide enough to support all system locales.
Converting to UTF-8 is a bit more complicated because of its codepage layout designed to be compatible with 7-bit ASCII. Some bit shifting is required.
Start with the UTF-8 table.
U+0000 U+007F 0xxxxxxx
U+0080 U+07FF 110xxxxx 10xxxxxx
U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
Turn that into a big if/else if statement.
wint_t codepoint_to_utf8( const wint_t codepoint ) {
wint_t utf8 = 0;
// U+0000 U+007F 0xxxxxxx
if( codepoint <= 0x007F ) {
}
// U+0080 U+07FF 110xxxxx 10xxxxxx
else if( codepoint <= 0x07FF ) {
}
// U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
else if( codepoint <= 0xFFFF ) {
}
// U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
else if( codepoint <= 0x10FFFF ) {
}
else {
fprintf( stderr, "Codepoint %x is out of UTF-8 range\n", codepoint);
return -1;
}
return utf8;
}
And start filling in the blanks. The first one is easy, it's just the code point.
// U+0000 U+007F 0xxxxxxx
if( codepoint <= 0x007F ) {
utf8 = codepoint;
}
To do the next one, we need to apply a bit mask and do some bit shifting. C doesn't support binary literals, so I converted the binary into hex using perl -wle 'printf("%x\n", 0b1100000010000000)'
// U+0080 U+07FF 110xxxxx 10xxxxxx
else if( codepoint <= 0x00007FF ) {
// Start at 1100000010000000
utf8 = 0xC080;
// 6 low bits using the bitmask 00111111
// That fills in the 10xxxxxx part.
utf8 += codepoint & 0x3f;
// 5 high bits using the bitmask 11111000000
// Shift over 2 to jump the hard coded 10 in the low byte.
// That fills in the 110xxxxx part.
utf8 += (codepoint & 0x7c0) << 2;
}
I'll leave the rest to you.
We can test this with various interesting values that touch each piece of logic.
int main() {
// https://codepoints.net/U+0041
printf("LATIN CAPITAL LETTER A: %x\n", codepoint_to_utf8(0x0041));
// https://codepoints.net/U+00A2
printf("Cent sign: %x\n", codepoint_to_utf8(0x00A2));
// https://codepoints.net/U+2603
printf("Snowman: %x\n", codepoint_to_utf8(0x02603));
// https://codepoints.net/U+10160
printf("GREEK ACROPHONIC TROEZENIAN TEN: %x\n", codepoint_to_utf8(0x10160));
printf("Out of range: %x\n", codepoint_to_utf8(0x00200000));
}
This is an interesting exercise, but if you want to do this for real use a pre-existing library. Gnome Lib has Unicode manipulation functions, and a lot more missing pieces of C.