Mapping multibyte characters to their unicode point representation

Question

How do you map a single UTF-8 character to its unicode point in C? [For example, È would be mapped to 00c8].

score 4 · Accepted Answer · answered May 24 '11 at 09:41

If your platform's wchar_t stores unicode (if it's a 32-bit type, it probably does) and you have an UTF-8 locale, you can call mbrtowc (from C90.1).

mbstate_t state = {0};
wchar_t wch;
char s[] = "\303\210";
size_t n;
memset(&state, 0, sizeof(state));
setlocale(LC_CTYPE, "en_US.utf8"); /*error checking omitted*/
n = mbrtowc(&wch, s, strlen(s), &state);
if (n <= (size_t)-2) printf("%lx\n", (unsigned long)wch);

For more flexibility, you can call the iconv interface.

char s[] = "\303\210";
iconv_t cd = iconv_open("UTF-8", "UCS-4");
if (cd != -1) {
    char *inp = s;
    size_t ins = strlen(s);
    uint32_t c;
    uint32_t *outp = &c;
    size_t outs = 0;
    if (iconv(cd, &inp, &ins, &outp, &outs) + 1 >= 2) printf("%lx\n", c);
    iconv_close(cd);
}

Thanks for your post. I probably want `iconv_t cd = iconv_open("UTF-8", "UCS-2");`. — Ken, May 24 '11 at 09:58
You're right -- "UCS-*" is mentioned in the `iconv` header document. I missed that (I tried lots of other combinations). Your answer is exactly what needed, thanks. — Ken, May 24 '11 at 10:08

score 2 · Answer 2 · answered May 24 '11 at 08:48

2

Some things to look at :

libiconv
ConvertUTF.h
MultiByteToWideChar (under windows)

answered May 24 '11 at 08:48

siukurnin

2,862
17
20

Thanks for your post. I'm a little confused - I thought `iconv` was for changing the encoding, not for mapping to representations as strings. Perhaps I'm missing something glaring. – Ken May 24 '11 at 09:07
@SK9: a string's encoding is the way its characters are represented, so I'm confused as to where you're confused. – Gilles 'SO- stop being evil' May 24 '11 at 09:56
I hadn't realised "UCS-2" was available. – Ken May 24 '11 at 12:05

Patrick Schlüter · Answer 3 · 2011-05-24T19:36:36.597

0

An reasonably fast implementation of an UTF-8 to UCS-2 converter. Surrogate and characters outside the BMP left as exercice. The function returns the number of bytes consumed from the input s string. A negative value represents an error. The resulting unicode character is put at the address p points to.

int utf8_to_wchar(wchar_t *p, const char *s)
{
const unsigned char *us = (const unsigned char *)s;
   p[0] = 0;
   if(!*us)
     return 0;
    else 
      if(us[0] < 0x80) {
        p[0] = us[0];
        return 1;
      }
      else 
        if(((us[0] & 0xE0) == 0xC0) && (us[1] & 0xC0) == 0x80) {
          p[0] = ((us[0] & 0x1F) << 6) | (us[1] & 0x3F);
#ifdef DETECT_OVERLONG
          if(p[0] < 0x80) return -2;
#endif    
          return 2;
        }
        else 
          if(((us[0] & 0xF0) == 0xE0) && (us[1] & 0xC0) == 0x80 && (us[2] & 0xC0) == 0x80) {
            p[0] = ((us[0] & 0x0F) << 12) | ((us[1] & 0x3F) << 6) | (us[2] & 0x3F);
#ifdef DETECT_OVERLONG
          if(p[0] < 0x800) return -2;
#endif    
            return 3;
          }
    return -1;
  }

edited May 24 '11 at 19:36

answered May 24 '11 at 09:35

Patrick Schlüter

11,394
1
43
48

I think you got the three-byte case wrong. You need to treat `s` as an `unsigned char*` all along, but even then, I get `0xe201c` for `“` (\U{201c}). – Gilles 'SO- stop being evil' May 24 '11 at 09:56
Wow, I have now understood the 0xe201c and why it didn't happen on our implementation. We have a 16bit wchar_t, so the `0xe2 << 12` oveflows the 16bit word and the result is 0x2000 and not 0xe2000 word. – Patrick Schlüter May 24 '11 at 11:37
1

-1 for dangerous code that accepts invalid sequences. Don't roll your own unless you know what you're doing! – R.. GitHub STOP HELPING ICE May 24 '11 at 14:11
Which invalid sequences? It doesn't support 4 byte sequences and it doesn't check for the surrogate space, but that was said in the text above. Implementing its own (limited) version is sometimes necessary where compiler are outdated and the system doesn't provide libraries (embedded systems). – Patrick Schlüter May 24 '11 at 19:34

Mapping multibyte characters to their unicode point representation

3 Answers3

Linked