Unicode to UTF-8 in C++

Question

I searched a lot, but couldn't find anything:

unsigned int unicodeChar = 0x5e9;
unsigned int utf8Char;
uni2utf8(unicodeChar, utf8Char);
assert(utf8Char == 0xd7a9);

Is there a library (preferably boost) that implements something similar to uni2utf8?

For the new c++11 unicode string literals see http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c11 — Martin Beckett, Jul 22 '12 at 19:50
What you're asking for does not make sense and cannot work. There is no such thing as a UTF-8 character. There are UTF-8 *code units*, which are 8-bit values that when properly decoded form a Unicode codepoint. But UTF-8 code units are *not* stored in `unsigned int`s of 32-bits in size. Each code unit is 8 bits in size; therefore, the way to store a Unicode codepoint in UTF-8 is as a sequence of code units. A *string*, not an integer. — Nicol Bolas, Jul 22 '12 at 20:16
utf8 is not Unicode, utf8 is a method for representing numbers. unicode on the other hand is a mapping between symbols to numbers. Abstract numbers, not their representation. — Ezra, Jan 28 '14 at 20:29

Philipp · Answer 1 · 2012-07-22T20:06:32.490

15

Unicode conversions are part of C++11:

#include <codecvt>
#include <locale>
#include <string>
#include <cassert>

int main() {
  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> convert;
  std::string utf8 = convert.to_bytes(0x5e9);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

edited Jul 22 '12 at 20:06

answered Jul 22 '12 at 19:59

Philipp

48,066
12
84
109

is there a boost equivalent? (for those who can't code c++11) – Ezra Jul 22 '12 at 20:05
@Ezra Yes, there is Boost.Locale, I've added another answer for that. – Philipp Jul 22 '12 at 20:18
3

You don't need codecvt_utf8. `codecvt` converts between UTF-32 and UTF-8, and `codecvt` converts between UTF-16 and UTF-8. – bames53 Jul 22 '12 at 23:50
@bames53: I strongly suspect that works only if `char` is natively UTF-8. E.g. Linux, but not Windows. – MSalters Jul 23 '12 at 07:58
@MSalters No, the standard mandates that those codecvt specialzations must use UTF-8 as the 'external' encoding and UTF-32/UTF-16 as the 'internal' encodings. – bames53 Jul 23 '12 at 08:22
1

@bames53 Three reasons to prefer `codecvt_utf8` (at least in conjunction with `wstring_convert`): 1. It contains the word `utf8`, so it's clearer to the reader what's happening. 2. It's shorter (fewer template arguments required). 3. `codecvt` has a protected destructor and is therefore not usable as a drop-in replacement for `codecvt_utf8`. If you're using `wstring_convert`, you need C++11 anyway, so so always have `codecvt_utf8` at your disposal. I don't see much value in using `codecvt` here. – Philipp Jul 23 '12 at 18:45

score 10 · Accepted Answer · answered Jul 22 '12 at 20:18

Boost.Locale has also functions for encoding conversions:

#include <boost/locale.hpp>

int main() {
  unsigned int point = 0x5e9;
  std::string utf8 = boost::locale::conv::utf_to_utf<char>(&point, &point + 1);
  assert(utf8.length() == 2);
  assert(utf8[0] == '\xD7');
  assert(utf8[1] == '\xA9');
}

score 4 · Answer 3 · edited Jul 22 '12 at 19:46

4

You might want to give a try to UTF8-CPP library. Encoding a Unicode character with it would look like this:

std::wstring unicodeChar(L"\u05e9");
std::string utf8Char;
encode_utf8(unicodeChar, utf8Char);

std::string is used here just as a container for UTF-8 bytes.

edited Jul 22 '12 at 19:46

Philipp

48,066
12
84
109

answered Jul 22 '12 at 19:46

Desmond Hume

8,037
14
65
112

1

Doesn't this assume that your `unicodeChar` is encoded in UTF-32? As far as I know, "wide strings" in C and C++ have an unspecified, opaque "system encoding" that could be anything. You'd first need to convert your wide string to UTF-32 using something like `iconv`. – Kerrek SB Jul 22 '12 at 21:49
@KerrekSB Do you see me using raw C wide strings alone or in conjunction with platform-specific implementation of `std::wstring`? – Desmond Hume Jul 22 '12 at 21:54
@KerrekSB Did I forget to "cook" that raw wide string with `std::wstring`, which knows full well how such strings should be handled on the current platform/compiler? – Desmond Hume Jul 22 '12 at 22:08
What do you think `wstring` is? It's just a container of `wchar_t`s, and you initialize those from a bog-standard wide string literal. Where's the "cooking"? – Kerrek SB Jul 22 '12 at 22:26
This code indeed won't work on Windows, where `wchar_t` is UCS-2/UTF-16 (16 bits, at least) and therefore cannot convert `U+10000` to UTF-8 – MSalters Jul 23 '12 at 08:01
@Msalters I searched the UTF8-CPP library website for a function `encode_utf8()` but didn't find it mentioned anywhere, so I'm not sure what it's specified to do. However, if it's taking a wstring then on Windows it should be expecting UTF-16 input. If so `encode_utf8(L"\U00010000",utf8Char)` would work just fine for converting U+10000 to UTF-8. – bames53 Jul 23 '12 at 19:18

score -3 · Answer 4 · answered Jul 22 '12 at 20:04

-3

Use sprintf. (:

cstring = sprintf("%S", unicodestring);

answered Jul 22 '12 at 20:04

iDomo

179
1
2
14

Unicode to UTF-8 in C++

4 Answers4

Linked