32

If I have a UTF-8 std::string how do I convert it to a UTF-16 std::wstring? Actually, I want to compare two Persian words.

Null
  • 1,950
  • 9
  • 30
  • 33
aliakbarian
  • 709
  • 1
  • 11
  • 20
  • 1
    See http://stackoverflow.com/questions/148403/utf8-to-from-wide-char-conversion-in-stl among others. – Mark Ransom Aug 22 '11 at 21:44
  • possible duplicate of [how can I compare utf8 string such as persian words in c++?](http://stackoverflow.com/questions/7141417/how-can-i-compare-utf8-string-such-as-persian-words-in-c) or [this](http://stackoverflow.com/questions/7141260/compare-stdwstring-and-stdstring). – Kerrek SB Aug 22 '11 at 21:47

6 Answers6

53

This is how you do it with C++11:

std::string str = "your string in utf8";
std::wstring_convert<std::codecvt_utf8_utf16<char16_t>> converter;
std::wstring wstr = converter.from_bytes(str);

And these are the headers you need:

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>

A more complete example available here: http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes

Yuchen
  • 30,852
  • 26
  • 164
  • 234
  • 1
    Great answer, thanks! ...but do follow the example at cppreference.com. `wchar_t` is not a 16-bit type on operating systems other than Windows. You need to use `char16_t` instead. – Cris Luengo Mar 26 '17 at 18:30
  • 1
    @CrisLuengo thanks! I updated the answer to use `char16_t` instead. – Yuchen Mar 27 '17 at 12:24
  • 4
    Not working with g++ 6.2 or clang++ 3.8 on lubuntu 16.04 –  May 08 '17 at 17:55
  • 7
    Unfortunately, this was deprecated in C++17. https://mariusbancila.ro/blog/2018/07/05/c17-removed-and-deprecated-features/ – Andrey Belykh Sep 11 '19 at 18:53
31

Here's some code. Only lightly tested and there's probably a few improvements. Call this function to convert a UTF-8 string to a UTF-16 wstring. If it thinks the input string is not UTF-8 then it will throw an exception, otherwise it returns the equivalent UTF-16 wstring.

std::wstring utf8_to_utf16(const std::string& utf8)
{
    std::vector<unsigned long> unicode;
    size_t i = 0;
    while (i < utf8.size())
    {
        unsigned long uni;
        size_t todo;
        bool error = false;
        unsigned char ch = utf8[i++];
        if (ch <= 0x7F)
        {
            uni = ch;
            todo = 0;
        }
        else if (ch <= 0xBF)
        {
            throw std::logic_error("not a UTF-8 string");
        }
        else if (ch <= 0xDF)
        {
            uni = ch&0x1F;
            todo = 1;
        }
        else if (ch <= 0xEF)
        {
            uni = ch&0x0F;
            todo = 2;
        }
        else if (ch <= 0xF7)
        {
            uni = ch&0x07;
            todo = 3;
        }
        else
        {
            throw std::logic_error("not a UTF-8 string");
        }
        for (size_t j = 0; j < todo; ++j)
        {
            if (i == utf8.size())
                throw std::logic_error("not a UTF-8 string");
            unsigned char ch = utf8[i++];
            if (ch < 0x80 || ch > 0xBF)
                throw std::logic_error("not a UTF-8 string");
            uni <<= 6;
            uni += ch & 0x3F;
        }
        if (uni >= 0xD800 && uni <= 0xDFFF)
            throw std::logic_error("not a UTF-8 string");
        if (uni > 0x10FFFF)
            throw std::logic_error("not a UTF-8 string");
        unicode.push_back(uni);
    }
    std::wstring utf16;
    for (size_t i = 0; i < unicode.size(); ++i)
    {
        unsigned long uni = unicode[i];
        if (uni <= 0xFFFF)
        {
            utf16 += (wchar_t)uni;
        }
        else
        {
            uni -= 0x10000;
            utf16 += (wchar_t)((uni >> 10) + 0xD800);
            utf16 += (wchar_t)((uni & 0x3FF) + 0xDC00);
        }
    }
    return utf16;
}
john
  • 85,011
  • 4
  • 57
  • 81
  • 2
    thank You! thank You! it worked... I cant believe it :) thank You for your time john – aliakbarian Aug 22 '11 at 22:23
  • Really glad it helped. It really is just a matter of asking the right question. There's a lot of knowledge on this forum, but newbies often can't access that knowledge because they don't know what to ask. – john Aug 22 '11 at 22:30
  • 1
    @aliakbarian: I've actually just spotted a minor bug in my code, you probably should copy it again. I changed this `if (j == utf8.size())` to this `if (i == utf8.size())`. – john Aug 22 '11 at 22:39
  • 1
    Note: this is windows only. Unix system use 32bit for wchar_t Alltho you can still do std::wstring wstr(str.begin(), str.end()); on Windows. – Simon Nitzsche Nov 22 '19 at 16:49
  • @coo Sure, that's possible. If your goal is to trash your data. Simply widening every UTF-8 code unit to fit into a UTF-16 code unit does not magically convert between those encodings. This will just produce gibberish for any code unit in the input sequence that doesn't happen to encode an ASCII code point. – IInspectable Jun 28 '20 at 16:37
  • great job, thanx, i've used it to convert fpc string(via @str[1]) to c++ wstring – roberto Feb 19 '22 at 10:18
  • This code allows, for example, invalid overlong encodings through, and having an intermediate copy means gratuitously copying each codepoint. I think it's better to use a carefully developed and tested algorithm. I would recommend the implementation in the [boost library] (https://github.com/boostorg/nowide) - it has a freestanding version so you can use it independently of the rest of boost. – AndyK Mar 07 '22 at 21:18
2

To convert between the 2 types, you should use: std::codecvt_utf8_utf16< wchar_t>
Note the string prefixes I use to define UTF16 (L) and UTF8 (u8).

#include <string>

#include <codecvt>

int main()
{

    std::string original8 = u8"הלו";

    std::wstring original16 = L"הלו";

    //C++11 format converter
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;

    //convert to UTF8 and std::string
    std::string utf8NativeString = convert.to_bytes(original16);

    std::wstring utf16NativeString = convert.from_bytes(original8);

    assert(utf8NativeString == original8);
    assert(utf16NativeString == original16);

    return 0;
}
Yochai Timmer
  • 48,127
  • 24
  • 147
  • 185
2

There are some relevant Q&A here and here which is worth a read.

Basically you need to convert the string to a common format -- my preference is always to convert to UTF-8, but your mileage may wary.

There have been lots of software written for doing the conversion -- the conversion is straigth forwards and can be written in a few hours -- however why not pick up something already done such as the UTF-8 CPP

Community
  • 1
  • 1
Soren
  • 14,402
  • 4
  • 41
  • 67
  • If you're Windows only: http://msdn.microsoft.com/en-us/library/dd319072(v=VS.85).aspx. Otherwise, use a portable library. – Mooing Duck Aug 22 '11 at 22:20
0

Microsoft has developed a beautiful library for such conversions as part of their Casablanca project also named as CPPRESTSDK. This is marked under the namespaces utility::conversions.

A simple usage of it would look something like this on using namespace

utility::conversions

utf8_to_utf16("sample_string");
Srijan Chaudhary
  • 637
  • 1
  • 8
  • 18
-1

This page also seems useful: http://www.codeproject.com/KB/string/UtfConverter.aspx

In the comment section of that page, there are also some interesting suggestions for this task like:

// Get en ASCII std::string from anywhere
std::string sLogLevelA = "Hello ASCII-world!";

std::wstringstream ws;
ws << sLogLevelA.c_str();
std::wstring sLogLevel = ws.str();

Or

// To std::string:
str.assign(ws.begin(), ws.end());
// To std::wstring
ws.assign(str.begin(), str.end());

Though I'm not sure the validity of these approaches...

jj1
  • 155
  • 1
  • 7