2

In C++ on Windows how do you convert an xml character reference of the form &#xhhhh; to a utf-16 little endian string?

I'm thinking if the hhhh part is 4 characters or less, then it's 2 bytes, which fit into one utf-16 character. But, this wiki page has a table of character references and some near the bottom are 5 digit hex numbers which won't fit into two bytes. How can they be converted to utf-16?

I'm wondering if the MultiByteToWideChar function is capable of doing the job.

My understanding of how a code point that's bigger than 2 bytes gets converted to utf-16 is lacking! (Or for that matter I'm not too sure how a code point that's bigger than 1 byte get's converted to utf-8, but that's another question).

Thanks.

Scott Langham
  • 58,735
  • 39
  • 131
  • 204
  • 1
    `MultiByteToWideChar` is totally inappropriate for this task. – Mark Ransom Mar 17 '21 at 19:37
  • Related: [MultiByteToWideChar for Unicode code pages 1200, 1201, 12000, 12001](https://stackoverflow.com/questions/29054217/multibytetowidechar-for-unicode-code-pages-1200-1201-12000-12001). – dxiv Mar 17 '21 at 19:48
  • 1
    The algorithm to convert a codepoint into UTF-16 is described on Wikipedia, see [UTF-16](https://en.wikipedia.org/wiki/UTF-16) – Remy Lebeau Mar 17 '21 at 20:02
  • @RemyLebeau but the bigger problem in this question is to convert each string `hhhh;` to a codepoint in the first place. Once you've done that your advice might be helpful. – Mark Ransom Mar 19 '21 at 04:15
  • @MarkRansom it is trivial to parse XML character references into numeric codepoint values. Especially if you use an actual XML parser and let it do the work for you – Remy Lebeau Mar 19 '21 at 04:20
  • @RemyLebeau maybe so, but funny that nobody mentioned it earlier. Seems like an essential part of the question to me. – Mark Ransom Mar 19 '21 at 04:33

1 Answers1

1

Unicode code-points (UTF-32) are 4 bytes wide and can be converted into a UTF-16character (and possible surrogate) using the following code (that I happen to have lying around).

It is not heavily tested so bug reports gratefully accepted:

/**
 * Converts U-32 code point to UTF-16 (and optional surrogate)
 * @param utf32 - UTF-32 code point
 * @param utf16 - returned UTF-16 character
 * @return - The number code units in the UTF-16 char (1 or 2).
 */
unsigned utf32_to_utf16(char32_t utf32, std::array<char16_t, 2>& utf16)
{
    if(utf32 < 0xD800 || (utf32 > 0xDFFF && utf32 < 0x10000))
    {
        utf16[0] = char16_t(utf32);
        utf16[1] = 0;
        return 1;
    }

    utf32 -= 0x010000;

    utf16[0] = char16_t(((0b1111'1111'1100'0000'0000 & utf32) >> 10) + 0xD800);
    utf16[1] = char16_t(((0b0000'0000'0011'1111'1111 & utf32) >> 00) + 0xDC00);

    return 2;
}
Galik
  • 47,303
  • 4
  • 80
  • 117
  • You might consider treating the range 0xd800 to 0xdfff specially, since those might be malformed input. – Mark Ransom Mar 17 '21 at 19:45
  • @MarkRansom Yes, I was wondering about the lack of error checking (I wrote this ages ago). But looking again at the Wikipedia article it says that, even though the range is technically bad code-points a lot of software allows them anyway... so I am going to have to mull on that for a bit. – Galik Mar 17 '21 at 20:05
  • 1
    It might not be malformed input either, if the codepoints are paired to make a valid UTF-16 character. JSON is encoded this way for example, see e.g. [Why does JSON encode UTF-16 surrogate pairs instead of Unicode code points directly?](https://stackoverflow.com/q/38463038/5987) – Mark Ransom Mar 17 '21 at 20:49