How do you map a single UTF-8 character to its unicode point in C?
[For example, È
would be mapped to 00c8
].
Asked
Active
Viewed 960 times
3 Answers
4
If your platform's wchar_t
stores unicode (if it's a 32-bit type, it probably does) and you have an UTF-8 locale, you can call mbrtowc
(from C90.1).
mbstate_t state = {0};
wchar_t wch;
char s[] = "\303\210";
size_t n;
memset(&state, 0, sizeof(state));
setlocale(LC_CTYPE, "en_US.utf8"); /*error checking omitted*/
n = mbrtowc(&wch, s, strlen(s), &state);
if (n <= (size_t)-2) printf("%lx\n", (unsigned long)wch);
For more flexibility, you can call the iconv interface.
char s[] = "\303\210";
iconv_t cd = iconv_open("UTF-8", "UCS-4");
if (cd != -1) {
char *inp = s;
size_t ins = strlen(s);
uint32_t c;
uint32_t *outp = &c;
size_t outs = 0;
if (iconv(cd, &inp, &ins, &outp, &outs) + 1 >= 2) printf("%lx\n", c);
iconv_close(cd);
}

Gilles 'SO- stop being evil'
- 104,111
- 38
- 209
- 254
-
Thanks for your post. I probably want `iconv_t cd = iconv_open("UTF-8", "UCS-2");`. – Ken May 24 '11 at 09:58
-
You're right -- "UCS-*" is mentioned in the `iconv` header document. I missed that (I tried lots of other combinations). Your answer is exactly what needed, thanks. – Ken May 24 '11 at 10:08
2
Some things to look at :
- libiconv
- ConvertUTF.h
- MultiByteToWideChar (under windows)

siukurnin
- 2,862
- 17
- 20
-
Thanks for your post. I'm a little confused - I thought `iconv` was for changing the encoding, not for mapping to representations as strings. Perhaps I'm missing something glaring. – Ken May 24 '11 at 09:07
-
@SK9: a string's encoding is the way its characters are represented, so I'm confused as to where you're confused. – Gilles 'SO- stop being evil' May 24 '11 at 09:56
-
0
An reasonably fast implementation of an UTF-8 to UCS-2 converter. Surrogate and characters outside the BMP left as exercice.
The function returns the number of bytes consumed from the input s
string. A negative value represents an error.
The resulting unicode character is put at the address p
points to.
int utf8_to_wchar(wchar_t *p, const char *s)
{
const unsigned char *us = (const unsigned char *)s;
p[0] = 0;
if(!*us)
return 0;
else
if(us[0] < 0x80) {
p[0] = us[0];
return 1;
}
else
if(((us[0] & 0xE0) == 0xC0) && (us[1] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x1F) << 6) | (us[1] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x80) return -2;
#endif
return 2;
}
else
if(((us[0] & 0xF0) == 0xE0) && (us[1] & 0xC0) == 0x80 && (us[2] & 0xC0) == 0x80) {
p[0] = ((us[0] & 0x0F) << 12) | ((us[1] & 0x3F) << 6) | (us[2] & 0x3F);
#ifdef DETECT_OVERLONG
if(p[0] < 0x800) return -2;
#endif
return 3;
}
return -1;
}

Patrick Schlüter
- 11,394
- 1
- 43
- 48
-
I think you got the three-byte case wrong. You need to treat `s` as an `unsigned char*` all along, but even then, I get `0xe201c` for `“` (\U{201c}). – Gilles 'SO- stop being evil' May 24 '11 at 09:56
-
Wow, I have now understood the 0xe201c and why it didn't happen on our implementation. We have a 16bit wchar_t, so the `0xe2 << 12` oveflows the 16bit word and the result is 0x2000 and not 0xe2000 word. – Patrick Schlüter May 24 '11 at 11:37
-
1-1 for dangerous code that accepts invalid sequences. Don't roll your own unless you know what you're doing! – R.. GitHub STOP HELPING ICE May 24 '11 at 14:11
-
Which invalid sequences? It doesn't support 4 byte sequences and it doesn't check for the surrogate space, but that was said in the text above. Implementing its own (limited) version is sometimes necessary where compiler are outdated and the system doesn't provide libraries (embedded systems). – Patrick Schlüter May 24 '11 at 19:34