1

assuming that I have

uint32_t a(3084);

I would like to create a string that stores the unicode character U+3084 which means that I should take the value of a and use it as the coordinate for the right character in the UTF8 table/charset.

Now, clearly std::to_string() doesn't work for me, there are a lot of functions in the standard to convert between numeric values and char, I can't find anything that grants me UTF8 support and outputs an std::string.

I would like to ask if I have to create this function from scratch or there is something in the C++11 standard that can help me with that; please note that my compiler ( gcc/g++ 4.8.1 ) doesn't offer a complete support for codecvt.

Yu Hao
  • 119,891
  • 44
  • 235
  • 294
user2485710
  • 9,451
  • 13
  • 58
  • 102
  • 6
    Note, the integer 3084 would actually correspond to the character U+0C0C, since Unicode codepoint numbers are expressed in hexadecimal. – Wyzard Nov 14 '13 at 03:06
  • @Wyzard and that's just part of the problem, anyway I will be happy with `char` too instead of `std::string` as an output. – user2485710 Nov 14 '13 at 03:11

4 Answers4

11

Here's some C++ code that wouldn't be hard to convert to C. Adapted from an older answer.

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}
Community
  • 1
  • 1
Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • it's perfect just the way it is, I need this for C++ 11 non C, thanks. – user2485710 Nov 14 '13 at 03:36
  • @user2485710 the question was originally tagged C along with C++, that's why I mentioned it. I see you removed that tag. – Mark Ransom Nov 14 '13 at 03:46
  • Note that the 4-byte encoding of UTF-8 can only handle codepoints up to U+1FFFFF, but `unsigned int` can go up to 0xffffffff, so you might want to add a check for that. – Remy Lebeau Nov 15 '13 at 16:32
  • @RemyLebeau, if you're going to worry about invalid Unicode there's a lot more to worry about than an out-of-bounds codepoint. I'm a firm believer in Garbage In, Garbage Out. That said this code already strips off the upper bits if present. – Mark Ransom Nov 15 '13 at 16:41
7

std::string_convert::to_bytes has a single-char overload just for you.

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    uint32_t a(3084);

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
    std::string u8str = conv1.to_bytes(a);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
}

I get (with libc++)

$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c 
Cubbi
  • 46,567
  • 13
  • 103
  • 169
1

The C++ standard contains the std::codecvt<char32_t, char, mbstate_t> facet which converts between UTF-32 and UTF-8 according to 22.4.1.4 [locale.codecvt] paragraph 3. Sadly, the std::codecvt<...> facets aren't easy to use. At some point there was discussion about filtering stream buffers which would take case of the code conversion (the standard C++ library needs to implement them anyway for std::basic_filebuf<...>) but I can't see any trace of these.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • there is nothing for just an `unsigned` to `char` conversion ? I have noticed that the UTF support is the part of the C++ standard that just floats randomly among different implementations, it's probably the only part that is "complex" to handle because is poorly implemented and poorly executed . – user2485710 Nov 14 '13 at 03:31
  • 1
    [std::wstring_convert](http://en.cppreference.com/w/cpp/locale/wstring_convert/to_bytes) isn't all that hard to use... – Cubbi Nov 14 '13 at 03:34
  • @Cubbi in theory yes, in practice, it's just not implemented yet for my compiler/standard library. – user2485710 Nov 14 '13 at 03:37
0
auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;

prints

for me (using g++ 4.8.1). s has type const char*, as you'd expect, but I don't know if this is implementation defined. Unfortunately C++ doesn't have any support for manipulation of UTF8 strings are far as I know; for that you need to use a library like Glib::ustring.

Tristan Brindle
  • 16,281
  • 4
  • 39
  • 82