Unsigned integer as UTF-8 value

Question

assuming that I have

uint32_t a(3084);

I would like to create a string that stores the unicode character U+3084 which means that I should take the value of a and use it as the coordinate for the right character in the UTF8 table/charset.

Now, clearly std::to_string() doesn't work for me, there are a lot of functions in the standard to convert between numeric values and char, I can't find anything that grants me UTF8 support and outputs an std::string.

I would like to ask if I have to create this function from scratch or there is something in the C++11 standard that can help me with that; please note that my compiler ( gcc/g++ 4.8.1 ) doesn't offer a complete support for codecvt.

Note, the integer 3084 would actually correspond to the character U+0C0C, since Unicode codepoint numbers are expressed in hexadecimal. — Wyzard, Nov 14 '13 at 03:06
@Wyzard and that's just part of the problem, anyway I will be happy with `char` too instead of `std::string` as an output. — user2485710, Nov 14 '13 at 03:11

score 11 · Accepted Answer · edited May 23 '17 at 11:53

11

Here's some C++ code that wouldn't be hard to convert to C. Adapted from an older answer.

std::string UnicodeToUTF8(unsigned int codepoint)
{
    std::string out;

    if (codepoint <= 0x7f)
        out.append(1, static_cast<char>(codepoint));
    else if (codepoint <= 0x7ff)
    {
        out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else if (codepoint <= 0xffff)
    {
        out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    else
    {
        out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
        out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
    }
    return out;
}

edited May 23 '17 at 11:53

Community

1
1

answered Nov 14 '13 at 03:30

Mark Ransom

299,747
42
398
622

it's perfect just the way it is, I need this for C++ 11 non C, thanks. – user2485710 Nov 14 '13 at 03:36
@user2485710 the question was originally tagged C along with C++, that's why I mentioned it. I see you removed that tag. – Mark Ransom Nov 14 '13 at 03:46
Note that the 4-byte encoding of UTF-8 can only handle codepoints up to U+1FFFFF, but `unsigned int` can go up to 0xffffffff, so you might want to add a check for that. – Remy Lebeau Nov 15 '13 at 16:32
@RemyLebeau, if you're going to worry about invalid Unicode there's a lot more to worry about than an out-of-bounds codepoint. I'm a firm believer in Garbage In, Garbage Out. That said this code already strips off the upper bits if present. – Mark Ransom Nov 15 '13 at 16:41

score 7 · Answer 2 · answered Nov 14 '13 at 03:38

std::string_convert::to_bytes has a single-char overload just for you.

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
#include <iomanip>

// utility function for output
void hex_print(const std::string& s)
{
    std::cout << std::hex << std::setfill('0');
    for(unsigned char c : s)
        std::cout << std::setw(2) << static_cast<int>(c) << ' ';
    std::cout << std::dec << '\n';
}

int main()
{
    uint32_t a(3084);

    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> conv1;
    std::string u8str = conv1.to_bytes(a);
    std::cout << "UTF-8 conversion produced " << u8str.size() << " bytes:\n";
    hex_print(u8str);
}

I get (with libc++)

$ ./test
UTF-8 conversion produced 3 bytes:
e0 b0 8c

This is the first example of `codecvt` I've ever seen work perfectly. Thanks! — Qix - MONICA WAS MISTREATED, Feb 02 '17 at 18:52

score 1 · Answer 3 · answered Nov 14 '13 at 03:21

1

The C++ standard contains the std::codecvt<char32_t, char, mbstate_t> facet which converts between UTF-32 and UTF-8 according to 22.4.1.4 [locale.codecvt] paragraph 3. Sadly, the std::codecvt<...> facets aren't easy to use. At some point there was discussion about filtering stream buffers which would take case of the code conversion (the standard C++ library needs to implement them anyway for std::basic_filebuf<...>) but I can't see any trace of these.

answered Nov 14 '13 at 03:21

Dietmar Kühl

150,225
13
225
380

there is nothing for just an `unsigned` to `char` conversion ? I have noticed that the UTF support is the part of the C++ standard that just floats randomly among different implementations, it's probably the only part that is "complex" to handle because is poorly implemented and poorly executed . – user2485710 Nov 14 '13 at 03:31
1

[std::wstring_convert](http://en.cppreference.com/w/cpp/locale/wstring_convert/to_bytes) isn't all that hard to use... – Cubbi Nov 14 '13 at 03:34
@Cubbi in theory yes, in practice, it's just not implemented yet for my compiler/standard library. – user2485710 Nov 14 '13 at 03:37

score 0 · Answer 4 · answered Nov 14 '13 at 03:19

0

auto s = u8"\343\202\204"; // Octal escaped representation of HIRAGANA LETTER YA
std::cout << s << std::endl;

prints

や

for me (using g++ 4.8.1). s has type const char*, as you'd expect, but I don't know if this is implementation defined. Unfortunately C++ doesn't have any support for manipulation of UTF8 strings are far as I know; for that you need to use a library like Glib::ustring.

answered Nov 14 '13 at 03:19

Tristan Brindle

16,281
4
39
82

thanks but this is the easier way, at compile time, I need this at runtime. – user2485710 Nov 14 '13 at 03:20

Unsigned integer as UTF-8 value

4 Answers4

Linked