12

I have to handle a file format (both read from and write to it) in which strings are encoded in UTF-16 (2 bytes per character). Since characters out of the ASCII table are rarely used in the application domain, all of the strings in my C++ model classes are stored in instances of std::string (UTF-8 encoded).

I'm looking for a library (searched in STL and Boost with no luck) or a set of C/C++ functions to handle this std::string <-> UTF-16 conversion when loading from or saving to file format (actually modeled as a bytestream) including the generation/recognition of surrogate pairs and all that Unicode stuffs (I'm admittedly no expert with)...

Any suggestions? Thanks!

EDIT: forgot to mention it should be cross-platform (Win / Mac) and cannot use C++11.

Peter
  • 237
  • 1
  • 2
  • 8
  • Ah, I looked into ICU but it seems too over-sized for my task. – Peter Jun 18 '12 at 15:37
  • If you are _only_ targeting windows, use [WideCharToMultiByte](http://msdn.microsoft.com/en-us/library/windows/desktop/dd374130(v=vs.85).aspx), in all other cases, use [ICU](http://site.icu-project.org/). It _can_ be done yourself, but shouldn't be. – Mooing Duck Jun 18 '12 at 15:41
  • This has been asked plenty of times, the one I'm most familiar with is http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl – Mark Ransom Jun 18 '12 at 15:45
  • Oh hey, [boost has unicode iterators!](http://www.boost.org/doc/libs/1_50_0_beta1/libs/regex/doc/html/boost_regex/ref/internal_details/uni_iter.html) – Mooing Duck Jun 18 '12 at 15:51

3 Answers3

18

C++11 has this functionality:

std::string s = u8"Hello, World!";

// #include <codecvt>
std::wstring_convert<std::codecvt<char16_t,char,std::mbstate_t>,char16_t> convert;

std::u16string u16 = convert.from_bytes(s);
std::string u8 = convert.to_bytes(u16);

However to my knowledge the only implementation that has this so far is libc++. C++11 also has std::codecvt_utf8_utf16<char16_t> which some other implementations have. Specifically, codecvt_utf8_utf16 works in VS 2010 and above, and since wchar_t is used by Windows to represent UTF-16 you can use this to convert between UTF-8 and Windows' native encoding.


The specialization codecvt<char16_t, char, mbstate_t> converts between the UTF-16 and UTF-8 encoding schemes, and the specialization codecvt<char32_t, char, mbstate_t> converts between the UTF-32 and UTF-8 encoding schemes.

                                                                                                                         — [locale.codecvt] 22.4.1.4/3


Oh, and std::codecvt specializations have protected destructors, and wstring_convert requires access to the destructor so you really need an adapter:

template <class Facet>
class usable_facet : public Facet {
public:
    using Facet::Facet; // inherit constructors
    ~usable_facet() {}

    // workaround for compilers without inheriting constructors:
    // template <class ...Args> usable_facet(Args&& ...args) : Facet(std::forward<Args>(args)...) {}
};

template<typename internT, typename externT, typename stateT> 
using codecvt = usable_facet<std::codecvt<internT, externT, stateT>>;

std::wstring_convert<codecvt<char16_t,char,std::mbstate_t>> convert;
Community
  • 1
  • 1
bames53
  • 86,085
  • 15
  • 179
  • 244
  • 1
    Whoa, what!? +1 I've never seen this :) – Felix Dombek Jun 18 '12 at 15:51
  • Hmm, made a test case to test extended plane characters, but IDEOne won't compile it: http://ideone.com/UdZcL – Mooing Duck Jun 18 '12 at 15:58
  • @MooingDuck unfortunately libstdc++ still hasn't implemented these specializations, even as of gcc 4.7. – bames53 Jun 18 '12 at 16:00
  • How do you manage utf-16 BE or utf-16 LE and are able to switch between them (while writing to file)? – Sandburg Nov 18 '19 at 11:25
  • 1
    @Sandburg One of the template parameters of the UTF codecvt facets is a codecvt_mode that allows you to specify options like endianness and BOMs. https://en.cppreference.com/w/cpp/locale/codecvt_mode – bames53 Jun 05 '20 at 07:48
4

Did you look at Boost.Locale? This page, in particular, describes how to do UTF to UTF conversions and how to integrate it with IOStreams.

thehouse
  • 7,957
  • 7
  • 33
  • 32
0

I would suggest having a look at:

Convert C++ std::string to UTF-16-LE encoded string

And check out the iconv function. It's a C library, no requirements for C++11.

There's also a Win32 specific iconv library at https://github.com/win-iconv/win-iconv.

JYG
  • 39
  • 1
  • 10