0

How to convert back and forth between a Unicode/UCS codepoint and a UTF16 surrogate pair in C++14 and later?

EDIT: Removed mention of UCS-2 surrogates, as there is no such thing. Thanks @remy-lebeau!

jotik
  • 17,044
  • 13
  • 58
  • 123

2 Answers2

7

The tag info page explains (better than specified by the Unicode Standard 9.0 in §3.9, Table 3-5.) the algorithm to convert from codepoint to surrogate pair as follows:

Unicode characters outside the Basic Multilingual Plane, that is characters with code above 0xFFFF, are encoded in UTF-16 by pairs of 16-bit code units called surrogate pairs, by the following scheme:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF;
  • the top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first code unit or high surrogate, which will be in the range 0xD800..0xDBFF;
  • the low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

In C++14 and later this could be written as:

#include <cstdint>

using codepoint = std::uint32_t;
using utf16 = std::uint16_t;

struct surrogate {
    utf16 high; // Leading
    utf16 low;  // Trailing
};

constexpr surrogate split(codepoint const in) noexcept {
    auto const inMinus0x10000 = (in - 0x10000);
    surrogate const r{
            static_cast<utf16>((inMinus0x10000 / 0x400) + 0xd800), // High
            static_cast<utf16>((inMinus0x10000 % 0x400) + 0xdc00)}; // Low
    return r;
}

In the reverse direction one just has to combine the last 10 bits from the high surrogate and the last 10 bits from the low surrogate, and add 0x10000:

constexpr codepoint combine(surrogate const s) noexcept {
    return static_cast<codepoint>(
            ((s.high - 0xd800) * 0x400) + (s.low - 0xdc00) + 0x10000);
}

Here's a test for these conversions:

#include <cassert>

constexpr bool isValidUtf16Surrogate(utf16 v) noexcept
{ return (v & 0xf800) == 0xd800; }

constexpr bool isValidCodePoint(codepoint v) noexcept {
    return (v <= 0x10ffff)
        && ((v >= 0x10000) || !isValidUtf16Surrogate(static_cast<utf16>(v)));
}

constexpr bool isValidUtf16HighSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xd800; }

constexpr bool isValidUtf16LowSurrogate(utf16 v) noexcept
{ return (v & 0xfc00) == 0xdc00; }

constexpr bool codePointNeedsUtf16Surrogates(codepoint v) noexcept
{ return (v >= 0x10000) && (v <= 0x10ffff); }

void test(codepoint const in) {
    assert(isValidCodePoint(in));
    assert(codePointNeedsUtf16Surrogates(in));
    auto const s = split(in);
    assert(isValidUtf16HighSurrogate(s.high));
    assert(isValidUtf16LowSurrogate(s.low));
    auto const out = combine(s);
    assert(isValidCodePoint(out));
    assert(in == out);
}

int main() {
    for (codepoint c = 0x10000; c <= 0x10ffff; ++c)
        test(c);
}
jotik
  • 17,044
  • 13
  • 58
  • 123
5

In C++11 and later, you can use std::wstring_convert to convert between various UTF/UCS encodings, using the following std::codecvt types:

You don't need to handle surrogates manually.

You can use std::u32string to hold your codepoint(s), and std::u16string to hold your UTF-16/UCS-2 codeunits.

For example:

using convert_utf16_uf32 = std::wstring_convert<std::codecvt_utf16<char32_t>, char16_t>;

std::u16string CodepointToUTF16(const char32_t codepoint)
{
    const char32_t *p = &codepoint;
    return convert_utf16_uf32{}.from_bytes(
        reinterpret_cast<const char*>(p),
        reinterpret_cast<const char*>(p+1)
    );
}

std::u16string UTF32toUTF16(const std::u32string &str)
{
    return convert_utf16_uf32{}.from_bytes(
        reinterpret_cast<const char*>(str.data()),
        reinterpret_cast<const char*>(str.data()+str.size())
    );
}

char32_t UTF16toCodepoint(const std::u16string &str)
{
    std::string bytes = convert_utf16_uf32{}.to_bytes(str);
    return *(reinterpret_cast<const char32_t*>(bytes.data()));
}

std::u32string UTF16toUTF32(const std::u16string &str)
{
    std::string bytes = convert_utf16_uf32{}.to_bytes(str);
    return std::u32string(
       reinterpret_cast<const char32_t*>(bytes.data()),
       bytes.size() / sizeof(char32_t)
    );
}
Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • The example to convert strings is helpful, but please provide the essential example to convert single codepoints as well. – jotik Mar 18 '17 at 09:23
  • 2
    Additionally, according to cppreference.com the `std::codecvt_`-prefixed classes seem to be deprecated in C++17. Any comment on that? – jotik Mar 18 '17 at 09:42
  • @jotik i updated my answer with single-codepoint examples. I have no idea why they are deprecated in C++17 or what they are being replaced with. – Remy Lebeau Mar 18 '17 at 16:05
  • 1
    @jotik from [std::wstring_convert and std::codecvt_utf8 deprecated](https://www.google.com/amp/s/amp.reddit.com/r/cpp_questions/comments/5yo0el/stdwstring_convert_and_stdcodecvt_utf8_deprecated/): "*In the history of the cppreference.com page there's a note that **"p0618r0 deprecated codecvt". This paper is not publicly available so we don't know what it says**.*" – Remy Lebeau Mar 18 '17 at 22:03
  • 1
    It's available now – Cubbi Mar 23 '17 at 20:36
  • 1
    @Cubbi: I see that now (http://open-std.org/JTC1/SC22/WG21/docs/papers/2017/p0618r0.html), but it still does not describe what ``, `std::wstring_convert`, and `std:::wbuffer_convert` are being replaced with, if anything, only that they are to be deprecated (and apparently the [committee did adopt p0618r0](https://isocpp.org/blog/2017/03/2017-03-post-kona-mailing-available): "*Adopted 2017-03*"). – Remy Lebeau Mar 23 '17 at 20:56
  • @RemyLebeau they aren't being replaced, they are being deprecated to encourage future replacement (and I think it's begging to repeat std::strstream's fate) – Cubbi Mar 23 '17 at 20:59