0

I need to convert a Unicode codepoint like "2460" to a string "①".

I've done my own research and found a library named ICU but I can't install it and get it to work.

I also know that "\u" is a thing, but my compiler doesn't allow me to do

string tmp = "2460";
cout << "\u" + tmp;

What can I do? My C++ understanding is pretty basic so please don't give me explanations that are too complicated.

  • 2
    Did you try `string tmp = "\u2460";`? – Galik Aug 27 '21 at 23:29
  • you also need to set your terminal/OS encoding to utf8 – bolov Aug 27 '21 at 23:45
  • @Galik I know that would work, but I need to read string from file and convert it to Unicode character. I have a string variable for the code, and I need to *convert* it to Unicode. – Volensia Volenski Aug 28 '21 at 01:30
  • Step 1, convert the string to a number which would be the Unicode codepoint. Step 2, convert the codepoint to UTF-8 - see for example [UTF8 to/from wide char conversion in STL](https://stackoverflow.com/q/148403/5987). – Mark Ransom Aug 28 '21 at 02:59

1 Answers1

1

This can be done using the Standard Library, but it's not the most obvious or easy functionality. It is further complicated by the fact that the Standard Library has change the way this works between C++11 and C++20 standards.

Here are two functions that use the Standard Library to convert between a Unicode Codepoint (char32_t) and a UTF-8 string (one for each version of The C++ Standard).

inline
std::string cpp11_codepoint_to_utf8(char32_t cp) // C++11 Sandard
{
    char utf8[4];
    char* end_of_utf8;

    char32_t const* from = &cp;

    std::mbstate_t mbs;
    std::codecvt_utf8<char32_t> ccv;

    if(ccv.out(mbs, from, from + 1, from, utf8, utf8 + 4, end_of_utf8))
        throw std::runtime_error("bad conversion");

    return {utf8, end_of_utf8};
}

inline
std::string cpp20_codepoint_to_utf8(char32_t cp) // C++20 Sandard
{
    using codecvt_32_8_type = std::codecvt<char32_t, char8_t, std::mbstate_t>;

    struct codecvt_utf8
    : public codecvt_32_8_type
        { codecvt_utf8(std::size_t refs = 0): codecvt_32_8_type(refs) {} };

    char8_t utf8[4];
    char8_t* end_of_utf8;

    char32_t const* from = &cp;

    std::mbstate_t mbs;
    codecvt_utf8 ccv;

    if(ccv.out(mbs, from, from + 1, from, utf8, utf8 + 4, end_of_utf8))
        throw std::runtime_error("bad conversion");

    return {reinterpret_cast<char*>(utf8), reinterpret_cast<char*>(end_of_utf8)};
}

Neither function has been heavily tested.

As far as converting between the string representation of the UTF Codepoint value "2460" and the integer number to store in char32_t, there are many ways to do this, just remember the number is in hexadecimal (base 16).

You can use something like this for example:

std::string tmp = "2460";
char32_t u32 = std::stoul(tmp, 0, 16);
tmp = cpp11_codepoint_to_utf8(u32);

std::cout << tmp << '\n';
Galik
  • 47,303
  • 4
  • 80
  • 117