5

I searched a lot, but couldn't find anything:

unsigned int unicodeChar = 0x5e9;
unsigned int utf8Char;
uni2utf8(unicodeChar, utf8Char);
assert(utf8Char == 0xd7a9);

Is there a library (preferably boost) that implements something similar to uni2utf8?

dda
  • 6,030
  • 2
  • 25
  • 34
Ezra
  • 1,401
  • 5
  • 15
  • 33
  • For the new c++11 unicode string literals see http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 – Martin Beckett Jul 22 '12 at 19:50
  • 2
    What you're asking for does not make sense and cannot work. There is no such thing as a UTF-8 character. There are UTF-8 *code units*, which are 8-bit values that when properly decoded form a Unicode codepoint. But UTF-8 code units are *not* stored in `unsigned int`s of 32-bits in size. Each code unit is 8 bits in size; therefore, the way to store a Unicode codepoint in UTF-8 is as a sequence of code units. A *string*, not an integer. – Nicol Bolas Jul 22 '12 at 20:16
  • 1. UTF8 is unicode 2. use nowide. – Pavel Radzivilovsky Jul 23 '12 at 20:56
  • utf8 is not Unicode, utf8 is a method for representing numbers. unicode on the other hand is a mapping between symbols to numbers. Abstract numbers, not their representation. – Ezra Jan 28 '14 at 20:29

4 Answers4

15

Unicode conversions are part of C++11:

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}
Philipp
  • 48,066
  • 12
  • 84
  • 109
  • is there a boost equivalent? (for those who can't code c++11) – Ezra Jul 22 '12 at 20:05
  • @Ezra Yes, there is Boost.Locale, I've added another answer for that. – Philipp Jul 22 '12 at 20:18
  • 3
    You don't need codecvt_utf8. `codecvt` converts between UTF-32 and UTF-8, and `codecvt` converts between UTF-16 and UTF-8. – bames53 Jul 22 '12 at 23:50
  • @bames53: I strongly suspect that works only if `char` is natively UTF-8. E.g. Linux, but not Windows. – MSalters Jul 23 '12 at 07:58
  • @MSalters No, the standard mandates that those codecvt specialzations must use UTF-8 as the 'external' encoding and UTF-32/UTF-16 as the 'internal' encodings. – bames53 Jul 23 '12 at 08:22
  • 1
    @bames53 Three reasons to prefer `codecvt_utf8` (at least in conjunction with `wstring_convert`): 1. It contains the word `utf8`, so it's clearer to the reader what's happening. 2. It's shorter (fewer template arguments required). 3. `codecvt` has a protected destructor and is therefore not usable as a drop-in replacement for `codecvt_utf8`. If you're using `wstring_convert`, you need C++11 anyway, so so always have `codecvt_utf8` at your disposal. I don't see much value in using `codecvt` here. – Philipp Jul 23 '12 at 18:45
10

Boost.Locale has also functions for encoding conversions:

#include <boost/locale.hpp>

int main() {
  unsigned int point = 0x5e9;
  std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}
Philipp
  • 48,066
  • 12
  • 84
  • 109
4

You might want to give a try to UTF8-CPP library. Encoding a Unicode character with it would look like this:

std::wstring unicodeChar(L"\u05e9");
std::string utf8Char;
encode_utf8(unicodeChar, utf8Char);

std::string is used here just as a container for UTF-8 bytes.

Philipp
  • 48,066
  • 12
  • 84
  • 109
Desmond Hume
  • 8,037
  • 14
  • 65
  • 112
  • 1
    Doesn't this assume that your `unicodeChar` is encoded in UTF-32? As far as I know, "wide strings" in C and C++ have an unspecified, opaque "system encoding" that could be anything. You'd first need to convert your wide string to UTF-32 using something like `iconv`. – Kerrek SB Jul 22 '12 at 21:49
  • @KerrekSB Do you see me using raw C wide strings alone or in conjunction with platform-specific implementation of `std::wstring`? – Desmond Hume Jul 22 '12 at 21:54
  • @KerrekSB Did I forget to "cook" that raw wide string with `std::wstring`, which knows full well how such strings should be handled on the current platform/compiler? – Desmond Hume Jul 22 '12 at 22:08
  • What do you think `wstring` is? It's just a container of `wchar_t`s, and you initialize those from a bog-standard wide string literal. Where's the "cooking"? – Kerrek SB Jul 22 '12 at 22:26
  • This code indeed won't work on Windows, where `wchar_t` is UCS-2/UTF-16 (16 bits, at least) and therefore cannot convert `U+10000` to UTF-8 – MSalters Jul 23 '12 at 08:01
  • @Msalters I searched the UTF8-CPP library website for a function `encode_utf8()` but didn't find it mentioned anywhere, so I'm not sure what it's specified to do. However, if it's taking a wstring then on Windows it should be expecting UTF-16 input. If so `encode_utf8(L"\U00010000",utf8Char)` would work just fine for converting U+10000 to UTF-8. – bames53 Jul 23 '12 at 19:18
-3

Use sprintf. (:

cstring = sprintf("%S", unicodestring);

iDomo
  • 179
  • 1
  • 2
  • 14