UTF8 to/from wide char conversion in STL

Question

Is it possible to convert UTF8 string in a std::string to std::wstring and vice versa in a platform independent manner? In a Windows application I would use MultiByteToWideChar and WideCharToMultiByte. However, the code is compiled for multiple OSes and I'm limited to standard C++ library.

Incidentally, the standard C++ library is not called STL; the STL is just a small subsection of the standard C++ library. In this case, I believe you are asking for functionality in the standard C++ library, and I've answered accordingly. — C. K. Young, Sep 29 '08 at 12:09
You haven't specified which encoding you want to end up with. wstring doesn't specify any particular encoding. Of course it'd be natural to convert to utf32 on platforms where wchar_t is 4 bytes wide, and utf16 if wchar_t is 2 bytes. Is that what you want? — jalf, Nov 11 '08 at 15:31
@jalf Your comment is misleading. `std::wstring` is `std::basic_string`. `wchar_t` is an opaque data type that represents a Unicode character (the fact that on Windows it is 16 bits long only means that Windows does not follow the standard). There is no “encoding” for abstract Unicode characters, they are just characters. — kirelagin, Mar 12 '20 at 20:35
[UTF8-CPP: UTF-8 with C++ in a Portable Way](https://github.com/nemtrif/utfcpp) — Assaf Lavie, Sep 29 '08 at 14:42

score 68 · Answer 1 · edited May 23 '17 at 10:31

68

I've asked this question 5 years ago. This thread was very helpful for me back then, I came to a conclusion, then I moved on with my project. It is funny that I needed something similar recently, totally unrelated to that project from the past. As I was researching for possible solutions, I stumbled upon my own question :)

The solution I chose now is based on C++11. The boost libraries that Constantin mentions in his answer are now part of the standard. If we replace std::wstring with the new string type std::u16string, then the conversions will look like this:

UTF-8 to UTF-16

std::string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::u16string dest = convert.from_bytes(source);

UTF-16 to UTF-8

std::u16string source;
...
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>,char16_t> convert;
std::string dest = convert.to_bytes(source);

As seen from the other answers, there are multiple approaches to the problem. That's why I refrain from picking an accepted answer.

edited May 23 '17 at 10:31

Community

1
1

answered Feb 11 '13 at 09:47

Vladimir Grigorov

10,903
8
60
70

wstring implies 2 or 4 bytes instead of single byte characters. Where's the question to switch from utf8 encoding? – Chawathe Vipul S Apr 25 '13 at 09:14
1

I've got some strange poor performance with codecvt, look here for details: http://stackoverflow.com/questions/26196686/utf8-utf16-codecvt-poor-performance – Xtra Coder Oct 04 '14 at 20:06
I think you should accept this answer. Sure there are multiple ways to solve this, but this is the only portable solution that does not need a library. – Navin Jun 16 '15 at 20:48
2

Is this UTF-16 with LE or BE? – thomthom Dec 14 '15 at 14:46
9

std::wstring_convert deprecated in C++17 – HojjatJafary Jun 19 '17 at 10:35
3

@HojjatJafary, what is the replacement? – jakar Feb 05 '20 at 21:40
@HojjatJafary None :). `codecvt_utf8_utf16` is deprecated too, by the way (and, no, there is no replacement either). – kirelagin Mar 12 '20 at 20:30
And the include you will want to use for this answer is #include – John Thoits Aug 04 '20 at 22:34

Mark Ransom · Answer 2 · 2023-04-01T02:10:36.247

31

The problem definition explicitly states that the 8-bit character encoding is UTF-8. That makes this a trivial problem; all it requires is a little bit-twiddling to convert from one UTF spec to another.

Just look at the encodings on these Wikipedia pages for UTF-8, UTF-16, and UTF-32.

The principle is simple - go through the input and assemble a 32-bit Unicode code point according to one UTF spec, then emit the code point according to the other spec. The individual code points need no translation, as would be required with any other character encoding; that's what makes this a simple problem.

Here's a quick implementation of wchar_t to UTF-8 conversion and vice versa. It assumes that the input is already properly encoded - the old saying "Garbage in, garbage out" applies here. I believe that verifying the encoding is best done as a separate step.

std::string wchar_to_UTF8(const wchar_t * in)
{
    std::string out;
    unsigned int codepoint = 0;
    for (in;  *in != 0;  ++in)
    {
        if (*in >= 0xd800 && *in <= 0xdbff)
            codepoint = ((*in - 0xd800) << 10) + 0x10000;
        else
        {
            if (*in >= 0xdc00 && *in <= 0xdfff)
                codepoint |= *in - 0xdc00;
            else
                codepoint = *in;

            if (codepoint <= 0x7f)
                out.append(1, static_cast<char>(codepoint));
            else if (codepoint <= 0x7ff)
            {
                out.append(1, static_cast<char>(0xc0 | ((codepoint >> 6) & 0x1f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else if (codepoint <= 0xffff)
            {
                out.append(1, static_cast<char>(0xe0 | ((codepoint >> 12) & 0x0f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            else
            {
                out.append(1, static_cast<char>(0xf0 | ((codepoint >> 18) & 0x07)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 12) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | ((codepoint >> 6) & 0x3f)));
                out.append(1, static_cast<char>(0x80 | (codepoint & 0x3f)));
            }
            codepoint = 0;
        }
    }
    return out;
}

The above code works for both UTF-16 and UTF-32 input, simply because the range d800 through dfff are invalid code points; they indicate that you're decoding UTF-16. If you know that wchar_t is 32 bits then you could remove some code to optimize the function.

std::wstring UTF8_to_wchar(const char * in)
{
    std::wstring out;
    unsigned int codepoint;
    while (*in != 0)
    {
        unsigned char ch = static_cast<unsigned char>(*in);
        if (ch <= 0x7f)
            codepoint = ch;
        else if (ch <= 0xbf)
            codepoint = (codepoint << 6) | (ch & 0x3f);
        else if (ch <= 0xdf)
            codepoint = ch & 0x1f;
        else if (ch <= 0xef)
            codepoint = ch & 0x0f;
        else
            codepoint = ch & 0x07;
        ++in;
        if (((*in & 0xc0) != 0x80) && (codepoint <= 0x10ffff))
        {
            if (sizeof(wchar_t) > 2)
                out.append(1, static_cast<wchar_t>(codepoint));
            else if (codepoint > 0xffff)
            {
                codepoint -= 0x10000;
                out.append(1, static_cast<wchar_t>(0xd800 + (codepoint >> 10)));
                out.append(1, static_cast<wchar_t>(0xdc00 + (codepoint & 0x03ff)));
            }
            else if (codepoint < 0xd800 || codepoint >= 0xe000)
                out.append(1, static_cast<wchar_t>(codepoint));
        }
    }
    return out;
}

Again if you know that wchar_t is 32 bits you could remove some code from this function, but in this case it shouldn't make any difference. The expression sizeof(wchar_t) > 2 is known at compile time, so any decent compiler will recognize dead code and remove it.

edited Apr 01 '23 at 02:10

answered Sep 29 '08 at 14:00

Mark Ransom

299,747
42
398
622

I don't see he seaid anything about std::string containing UTF-8 encoded strings in the original question: "Is it possible to convert std::string to std::wstring and vice versa in a platform independent manner?" – Nemanja Trifunovic Sep 29 '08 at 16:59
1

UTF-8 is specified in the title of the post. You are correct that it is missing from the body of the text. – Mark Ransom Sep 29 '08 at 18:07
Thank you for the correction, I did intend to use UTF8. I edited the question to be more clear. – Vladimir Grigorov Sep 30 '08 at 08:55
6

But ''widechar'' does not necessarily mean UTF16 – moogs Oct 16 '08 at 10:23
7

What you've got may be a good "proof of concept". It's one thing to convert valid encodings successfully. It is another level of effort to handle conversion of invalid encoding data (e.g. unpaired surrogates in UTF-16) correctly according to the specifications. For that you really need some more thoroughly designed and tested code. – Craig McQueen Jul 23 '11 at 23:56
2

@Craig McQueen, you're absolutely right. I made the assumption that the encoding was already correct, and it was just a mechanical conversion. I'm sure there are situations where that's the case, and this code would be adequate - but the limitations should be stated explicitly. It's not clear from the original question if this should be a concern or not. – Mark Ransom Jul 24 '11 at 01:00
I have the same feeling as you. The questions already states "UTF8", so it is an encoding/decoding issue. It has nothing to do with locale. Whose answer mentioned locale didn't get the point at all. – Tyler Liu Mar 23 '13 at 16:35
@moogs after all these years I just realized how close this was to working for both UTF-16 and UTF-32 `wchar_t`. I've updated the answer. – Mark Ransom Sep 29 '17 at 04:39
What do you mean by “It assumes that the input is already properly encoded”? Your input is made of `wchar_t`, which is an opaque data type that represents a Unicode character, you are not allowed to make any assumptions about its internal representation, the only thing you can do is call provided library functions on it, like `wctomb`, which will encode the character using current system locale encoding. – kirelagin Mar 12 '20 at 20:23
@kirelagin it's my code, I'm allowed to make any assumptions I want. I added that statement to make it clear that there wasn't any error checking in the code, and if you fed it invalid input I couldn't guarantee the correctness of the result. By "invalid input" I mean for example a code point greater than 0x10ffff. – Mark Ransom Mar 12 '20 at 21:00
@MarkRansom The author of the question is asking for “a platform independent manner”. Your manner is not only not platform independent, it is not even independent of the standard library implementation on a single platform. You can not make any assumptions about the numbers in the variable of type `wchar_t`, any error checking you can add will be incorrect and will be triggered by valid inputs on some platforms/implementations (potentially). – kirelagin Mar 13 '20 at 20:00
@kirelagin which is exactly why I suggest in the answer that conversion and validation should be two separate operations. If you want to assert that my code is incorrect you'll have to be more specific about the conditions under which it would be incorrect. The only assumption I make about `wchar_t` is that it holds a range of integers appropriate for the platform on which it is compiled. – Mark Ransom Mar 13 '20 at 20:32
@MarkRansom I’m sorry, I just got very confused by your statement that “The above code works for both UTF-16 and UTF-32 input”, because the input is `wchar_t`s, which are _abstract_ code points (according to the C standard), and the concept of encoding (UTF-16 or UTF-32) does not apply to them in any meaningful way. I see what you meant now: basically, this code works both with `wchar_t` that represent all of Unicode and platforms like Windows that hack `wchar_t` for their purposes. – kirelagin Mar 14 '20 at 15:13
@kirelagin `wchar_t` was *intended* to hold code points, but that's not how it worked out in practice. As a concrete example when Windows first got Unicode all the codepoints fit into 16 bits so `wchar_t` was made a 16-bit integer. Later when Unicode was extended they were forced to use UTF-16 encoding to make it work, and that's what Windows uses to this day with `wchar_t` still 16 bits. Don't look down on Windows, their problems stem from being an early adopter and they aren't alone. – Mark Ransom Mar 14 '20 at 15:53
1

@MarkRansom I found your answer and ported it to the Nim language for my purposes. However, for `UTF8_to_wchar` I found that in the `else if (codepoint > 0xffff)` case I needed to have `(0xd7c0 + (codepoint >> 10))` in place of `(0xd800 + (codepoint >> 10))`. I don't think that has anything to do with Nim, rather I'm wondering if it's a mistake that should be corrected in your answer. Thanks for your work on this, it's very helpful! – michaelsbradleyjr Mar 31 '23 at 16:33
1

@michaelsbradleyjr that's odd, I'm following the specs exactly which clearly say 0xd800, see https://en.wikipedia.org/wiki/UTF-16#Code_points_from_U+010000_to_U+10FFFF . But testing in Python with the smiley face `U+1f600` agrees with your observation. There's a mystery here. – Mark Ransom Mar 31 '23 at 19:39
1

@michaelsbradleyjr mystery solved. I missed the part "0x10000 is subtracted from the code point" so that's definitely a mistake in my code. I'll fix it tonight. – Mark Ransom Mar 31 '23 at 20:51
1

@michaelsbradleyjr fixed. Note that `0x10000>>10` is `0x40`, so mathematically your fix is the same as mine. Mine is easier to reconcile with the official algorithm description though. Amazing how a bug can go uncaught for 14 years. – Mark Ransom Apr 01 '23 at 02:15
Thanks, @MarkRansom! I made [corresponding changes](https://github.com/michaelsbradleyjr/nim-notcurses/commit/34e787749b9ec5b846460128ac98b59ed45aed71) in my code, and CI tests passed for Windows, Linux, etc. – michaelsbradleyjr Apr 01 '23 at 03:31

Constantin · Answer 3 · 2008-11-11T15:15:53.133

You can extract utf8_codecvt_facet from Boost serialization library.

Their usage example:

  typedef wchar_t ucs4_t;

  std::locale old_locale;
  std::locale utf8_locale(old_locale,new utf8_codecvt_facet<ucs4_t>);

  // Set a New global locale
  std::locale::global(utf8_locale);

  // Send the UCS-4 data out, converting to UTF-8
  {
    std::wofstream ofs("data.ucd");
    ofs.imbue(utf8_locale);
    std::copy(ucs4_data.begin(),ucs4_data.end(),
          std::ostream_iterator<ucs4_t,ucs4_t>(ofs));
  }

  // Read the UTF-8 data back in, converting to UCS-4 on the way in
  std::vector<ucs4_t> from_file;
  {
    std::wifstream ifs("data.ucd");
    ifs.imbue(utf8_locale);
    ucs4_t item = 0;
    while (ifs >> item) from_file.push_back(item);
  }

Look for utf8_codecvt_facet.hpp and utf8_codecvt_facet.cpp files in boost sources.

I though you had to imbue the stream before it is opened, otherwise the imbue is ignored! — Martin York, Nov 11 '08 at 05:33
Martin, it seems to work with Visual Studio 2005: 0x41a is successfully converted to {0xd0, 0x9a} UTF-8 sequence. — Constantin, Nov 11 '08 at 15:15

score 12 · Answer 4 · answered Sep 29 '08 at 13:44

There are several ways to do this, but the results depend on what the character encodings are in the string and wstring variables.

If you know the string is ASCII, you can simply use wstring's iterator constructor:

string s = "This is surely ASCII.";
wstring w(s.begin(), s.end());

If your string has some other encoding, however, you'll get very bad results. If the encoding is Unicode, you could take a look at the ICU project, which provides a cross-platform set of libraries that convert to and from all sorts of Unicode encodings.

If your string contains characters in a code page, then may $DEITY have mercy on your soul.

ICU converts too/from every character encoding I have ever come across. Its huge. — Martin York, Sep 29 '08 at 16:12

score 2 · Answer 5 · edited Oct 20 '12 at 20:44

2

You can use the codecvt locale facet. There's a specific specialisation defined, codecvt<wchar_t, char, mbstate_t> that may be of use to you, although, the behaviour of that is system-specific, and does not guarantee conversion to UTF-8 in any way.

edited Oct 20 '12 at 20:44

answered Sep 29 '08 at 12:07

C. K. Young

219,335
46
382
435

2

Doing encoding/decoding according to locale is a bad idea. Just as you said: "does not guarantee". – Tyler Liu Mar 23 '13 at 16:24
@TylerLong obviously one should configure std::locale instance specifically for the required conversion. – Basilevs May 11 '14 at 12:11
@Basilevs I still think using locale to encode/decode is wrong. The correct way is to configure `encoding` instead of `locale`. As far as I can tell, there is no such a locale which can represent **every** single unicode character. Let's say I want to encode a string which contains all of the unicode characters, which locale do you sugguest me to configure? Corret me if I am wrong. – Tyler Liu Dec 08 '14 at 12:52
@TylerLong Locale in C++ is very abstract concept that covers far more things than just regional settings and encodings. Basically one can.do everything with it. While codecvt_facet indeed handles more than just simple recoding, absolutely nothing prevents it from making simple unicode transformations. – Basilevs Dec 20 '14 at 14:14

TarmoPikaro · Answer 6 · 2019-06-03T21:18:35.600

Created my own library for utf-8 to utf-16/utf-32 conversion - but decided to make a fork of existing project for that purpose.

https://github.com/tapika/cutf

(Originated from https://github.com/noct/cutf )

API works with plain C as well as with C++.

Function prototypes looks like this: (For full list see https://github.com/tapika/cutf/blob/master/cutf.h )

//
//  Converts utf-8 string to wide version.
//
//  returns target string length.
//
size_t utf8towchar(const char* s, size_t inSize, wchar_t* out, size_t bufSize);

//
//  Converts wide string to utf-8 string.
//
//  returns filled buffer length (not string length)
//
size_t wchartoutf8(const wchar_t* s, size_t inSize, char* out, size_t outsize);

#ifdef __cplusplus

std::wstring utf8towide(const char* s);
std::wstring utf8towide(const std::string& s);
std::string  widetoutf8(const wchar_t* ws);
std::string  widetoutf8(const std::wstring& ws);

#endif

Sample usage / simple test application for utf conversion testing:

#include "cutf.h"

#define ok(statement)                                       \
    if( !(statement) )                                      \
    {                                                       \
        printf("Failed statement: %s\n", #statement);       \
        r = 1;                                              \
    }

int simpleStringTest()
{
    const wchar_t* chineseText = L"主体";
    auto s = widetoutf8(chineseText);
    size_t r = 0;

    printf("simple string test:  ");

    ok( s.length() == 6 );
    uint8_t utf8_array[] = { 0xE4, 0xB8, 0xBB, 0xE4, 0xBD, 0x93 };

    for(int i = 0; i < 6; i++)
        ok(((uint8_t)s[i]) == utf8_array[i]);

    auto ws = utf8towide(s);
    ok(ws.length() == 2);
    ok(ws == chineseText);

    if( r == 0 )
        printf("ok.\n");

    return (int)r;
}

And if this library does not satisfy your needs - feel free to open following link:

http://utf8everywhere.org/

and scroll down at the end of page and pick up any heavier library which you like.

score -1 · Answer 7 · answered Sep 29 '08 at 12:16

-1

I don't think there's a portable way of doing this. C++ doesn't know the encoding of its multibyte characters.

As Chris suggested, your best bet is to play with codecvt.

answered Sep 29 '08 at 12:16

Martin Cote

28,864
15
75
99

The question says "UTF8", so "the encoding of its multibyte characters" is known. – Tyler Liu Mar 23 '13 at 16:26

UTF8 to/from wide char conversion in STL

7 Answers7

Linked

Related