How to turn std::string that contains utf-16 encoded text in it into utf-16 wstring?

Question

So we get a string like РќРѕРІР°СЏ РїР°РїРєР° which is utf-8 representation of utf-16 encoded line (Новая папка in utf-16) we want to turn this string into wstring not changing encoding.. meaning literally bring all data from string to wstring with out any conversion. So we would get wstring with Новая папка contents. How to do such thing?

Update: What I meant to say - we have all data for correct utf-16 string inside of string. All we need is to put that data into wstring... that means if wstring contains of wchar which could happen to be 0000 we would have to put 2 string chars 00 and 00 together to get it. That is what I do not know how to do.

Update2 How I got here - a C++ lib I am obligated to use on my server is C style parser. and it returns me user request adress as std::string. while I make my clients send to me requests in such format.

url_encode(UTF16toUTF8(wstring)) //pseudocode.

where

string UTF16toUTF8(const wstring & in)
{
    string out;
    unsigned int codepoint;
    bool completecode = false;
    for (wstring::const_iterator p = in.begin();  p != in.end();  ++p)
    {
        if (*p >= 0xd800 && *p <= 0xdbff)
        {
            codepoint = ((*p - 0xd800) << 10) + 0x10000;
            completecode = false;
        }
        else if (!completecode && *p >= 0xdc00 && *p <= 0xdfff)
        {
            codepoint |= *p - 0xdc00;
            completecode = true;
        }
        else
        {
            codepoint = *p;
            completecode = true;
        }
        if (completecode)
        {
            if (codepoint <= 0x7f)
                out.push_back(codepoint);
            else if (codepoint <= 0x7ff)
            {
                out.push_back(0xc0 | ((codepoint >> 6) & 0x1f));
                out.push_back(0x80 | (codepoint & 0x3f));
            }
            else if (codepoint <= 0xffff)
            {
                out.push_back(0xe0 | ((codepoint >> 12) & 0x0f));
                out.push_back(0x80 | ((codepoint >> 6) & 0x3f));
                out.push_back(0x80 | (codepoint & 0x3f));
            }
            else
            {
                out.push_back(0xf0 | ((codepoint >> 18) & 0x07));
                out.push_back(0x80 | ((codepoint >> 12) & 0x3f));
                out.push_back(0x80 | ((codepoint >> 6) & 0x3f));
                out.push_back(0x80 | (codepoint & 0x3f));
            }
        }
    }
    return out;
}

std::string url_encode( std::string sSrc )
{
    const char SAFE[256] =
    {
        /*      0 1 2 3  4 5 6 7  8 9 A B  C D E F */
        /* 0 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* 1 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* 2 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* 3 */ 1,1,1,1, 1,1,1,1, 1,1,0,0, 0,0,0,0,

        /* 4 */ 0,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1,
        /* 5 */ 1,1,1,1, 1,1,1,1, 1,1,1,0, 0,0,0,0,
        /* 6 */ 0,1,1,1, 1,1,1,1, 1,1,1,1, 1,1,1,1,
        /* 7 */ 1,1,1,1, 1,1,1,1, 1,1,1,0, 0,0,0,0,

        /* 8 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* 9 */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* A */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* B */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,

        /* C */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* D */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* E */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0,
        /* F */ 0,0,0,0, 0,0,0,0, 0,0,0,0, 0,0,0,0
    };
    const char DEC2HEX[16 + 1] = "0123456789ABCDEF";
    const unsigned char * pSrc = (const unsigned char *)sSrc.c_str();
    const int SRC_LEN = sSrc.length();
    unsigned char * const pStart = new unsigned char[SRC_LEN * 3];
    unsigned char * pEnd = pStart;
    const unsigned char * const SRC_END = pSrc + SRC_LEN;

    for (; pSrc < SRC_END; ++pSrc)
    {
        if (SAFE[*pSrc]) 
            *pEnd++ = *pSrc;
        else
        {
            // escape this char
            *pEnd++ = '%';
            *pEnd++ = DEC2HEX[*pSrc >> 4];
            *pEnd++ = DEC2HEX[*pSrc & 0x0F];
        }
    }

    std::string sResult((char *)pStart, (char *)pEnd);
    delete [] pStart;
    return sResult;
}

std::string url_decode( std::string sSrc )
{
    // Note from RFC1630:  "Sequences which start with a percent sign
    // but are not followed by two hexadecimal characters (0-9, A-F) are reserved
    // for future extension"

    const char HEX2DEC[256] = 
    {
        /*       0  1  2  3   4  5  6  7   8  9  A  B   C  D  E  F */
        /* 0 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 1 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 2 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 3 */  0, 1, 2, 3,  4, 5, 6, 7,  8, 9,-1,-1, -1,-1,-1,-1,

        /* 4 */ -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 5 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 6 */ -1,10,11,12, 13,14,15,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 7 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,

        /* 8 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* 9 */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* A */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* B */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,

        /* C */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* D */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* E */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1,
        /* F */ -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1, -1,-1,-1,-1
    };

    const unsigned char * pSrc = (const unsigned char *)sSrc.c_str();
    const int SRC_LEN = sSrc.length();
    const unsigned char * const SRC_END = pSrc + SRC_LEN;
    const unsigned char * const SRC_LAST_DEC = SRC_END - 2;   // last decodable '%' 

    char * const pStart = new char[SRC_LEN];
    char * pEnd = pStart;

    while (pSrc < SRC_LAST_DEC)
    {
        if (*pSrc == '%')
        {
            char dec1, dec2;
            if (-1 != (dec1 = HEX2DEC[*(pSrc + 1)])
                && -1 != (dec2 = HEX2DEC[*(pSrc + 2)]))
            {
                *pEnd++ = (dec1 << 4) + dec2;
                pSrc += 3;
                continue;
            }
        }

        *pEnd++ = *pSrc++;
    }

    // the last 2- chars
    while (pSrc < SRC_END)
        *pEnd++ = *pSrc++;

    std::string sResult(pStart, pEnd);
    delete [] pStart;
    return sResult;
}

Ofcourse I call url_decode, but I get a string..( so I hope now my problem is more clear.

possible duplicate of [how to convert UTF-8 std::string to UTF-16 std::wstring](http://stackoverflow.com/questions/7153935/how-to-convert-utf-8-stdstring-to-utf-16-stdwstring) — Nicol Bolas, Aug 26 '11 at 22:18
@Kabumbus: Is the high byte or the low byte first in your string? If you have `12 34` in your string would you expect to get `1234` or `3412` in your wstring? — john, Aug 26 '11 at 22:25
@Kalumbus: I am now completely confused. I've changed my mind again and think that you really want to convert UTF-8 to UTF-16, rather than just doing some byte shifting. But who knows. If you do need UTF-8 to UTF-16 the link that Nicol Bolas posted above will work. — john, Aug 26 '11 at 22:45
@Kalumbus, can you post an example of what you want using byte and word values instead of using foreign characters which are just very confusing. — john, Aug 26 '11 at 22:57
Could you post the binary dump of your file (or at least a small part of it)? You can create such a dump easily with `xxd`. We could more easily understand how many times incorrect conversion were performed. — Sylvain Defresne, Aug 26 '11 at 23:18

Sqeaky · Answer 1 · 2011-08-27T04:46:24.297

3

Here is what I am tinkering around with for a solution to your issue:

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°");
std::wstring correct( (wchar_t*)wrong.data() );

According to http://www.cplusplus.com/reference/string/string/data/ the data() member function should give us the raw char* and simply casting to a (wchar_t*) should cause it to stick the 00 and 00 together to make 0000, as you describe in you example.

I personally don't like casting like this, but this is all I have come up with so far.

Edit - Which library are you using? Does it come with some other function to reverse what it has done?

If it is popular surely someone else has had this issue before. How did They solve it?

Edit 2 - Here is a disgusting way, using malloc, some assumptions that there won't be any half code-points in the original string, and another terrible cast. :(

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°");
wchar_t *lesswrong = (wchar_t*) malloc (wrong.size()/sizeof(wchar_t) + sizeof(wchar_t));
lesswrong = (wchar_t*)wrong.data();
lesswrong[wrong.size()] = '\0';
std::wstring correct( lesswrong );

There is no way this can be correct. Even if it works it is so ugly.

Edit 3 - Like Kerrick sadi, this is a better way to do it.

std::string wrong("РќРѕРІР°СЏ РїР°РїРєР°");
std::wstring correct( (wchar_t*)wrong.data(), wrong.size()/2 );

edited Aug 27 '11 at 04:46

answered Aug 26 '11 at 22:55

Sqeaky

1,876
3
21
40

1

You might also have to pass a size argument here, because `data()` doesn't give a (wchar-)null-terminated string. – Kerrek SB Aug 26 '11 at 23:03
1

This won't work if `sizeof(wchar_t)` is not equal to 2. On Mac OS X, `sizeof(wchar_t)` is 4 for 64-bit programs. – Sylvain Defresne Aug 26 '11 at 23:06
1

Will only work if the byte ordering correct, plus you have the lack of a null terminator that Kerrek mentions. – john Aug 26 '11 at 23:06
You are all right, I totally ignored the lack of null terminator :( I have another, but still disgusting idea – Sqeaky Aug 26 '11 at 23:27
1

This assumes that the data was encoded on this machine, so byte order should match, but you are right, on a unix machine this would stuff 2 utf16 codepoints into one utf32 codepoint. I made another assumption in assuming that the library he is using would use the same size for wchar_t that this code would use (seems like a stretch I admit). – Sqeaky Aug 26 '11 at 23:34
1

@Sqeaky : `std::wstring correct( (wchar_t*)wrong.data(), wrong.size() );` is definitely not correct. The constructor wants the number of characters, but `wrong.size()` is giving the number of bytes. – ildjarn Aug 27 '11 at 00:04

Sylvain Defresne · Answer 2 · 2011-08-26T23:31:59.437

1

If I understand you correctly, you have a std::string object that contains an UTF-16 encoded string, and you want to convert it to a std::wstring without changing the encoding. If I'm correct, then, you don't have to do conversion of encoding, nor of the representation but only of the storage.

You also think that the string may have incorrectly be encoded into UTF-8. However, UTF-8 is a variable length encoding, but the length of your incorrectly interpreted data (РќРѕРІР°СЏ РїР°РїРєР° is 22 characters long) is exactly twice the length of your original data (Новая папка is 11 characters long). This is why I suspect that this may be just a case of wrong storage and not wrong encoding.

The following code does that:

std::wstring convert_utf16_string_to_wstring(const std::string& input) {
    assert((input.size() & 1) == 0);
    size_t len = input.size() / 2;
    std::wstring output;
    output.resize(len);

    for (size_t i = 0; i < len; ++i) {
        unsigned char chr1 = (unsigned char)input[2 * i];
        unsigned char chr2 = (unsigned char)input[2 * i + 1];

        // Note: this line suppose that you use `UTF-16-BE` both for
        // the std::string and the std::wstring. You'll have to swap
        // chr1 & chr2 if this is not the case.
        unsigned short val = (chr2 << 8)|(chr1);
        output[i] = (wchar_t)(val);
    }

    return output;
}

If you know that on all the platform you target sizeof(wchar_t) equal 2 (this is not the case one Mac OS for 64-bit programs for exemple where sizeof(wchar_t) equals 4), then you can use a simple cast:

std::wstring convert_utf16_string_to_wstring(const std::string& input) {
    assert(sizeof(wchar_t) == 2); // A static assert would be better here
    assert((input.size() & 1) == 0);
    return input.empty()
        ? std::wstring()
        : std::wstring((wchar_t*)input[0], input.size() / 2);
}

edited Aug 26 '11 at 23:31

answered Aug 26 '11 at 22:38

Sylvain Defresne

42,429
12
75
85

This does not convert from UTF-8 to UTF-16, it merely converts from a byte representation of UTF-16 to a word representation. – bdonlan Aug 26 '11 at 22:40
well I get `⿯뾍껯뾢ꃯ뾿⃯뾯ꃯ뾯ꫯ뾠` which is much more like real letters but something is wrong..( – Rella Aug 26 '11 at 22:42
@bdonlan: It's not at all clear to me that the OP does want a UTF-8 to UTF-16 conversion. I've changed my mind twice on this. – john Aug 26 '11 at 22:49
1

@bdonlan: This OP does not want conversion from UTF-16 to UTF-8. He has a `std::string` that contains data encoded in `UTF-16` and he want that data stored in a `std::wstring`. – Sylvain Defresne Aug 26 '11 at 22:52
@Kabumbus: Sorry, I inverted `chr1` and `chr2` in one of the line. I've edited my post with what I expect is the correct version. BTW, if `sizeof(wchar_t) == 2` then you can probably just do a `memcpy(&output[0], &input[0], input.size())` instead of the loop. – Sylvain Defresne Aug 26 '11 at 23:01
The OP says, "So we get a string like РќРѕРІР°СЏ РїР°РїРєР° which is utf-8 representation of [...]" – bdonlan Aug 26 '11 at 23:15
@bdonlan: well, if I take the original string _Новая папка_, encoded in utf-16, interpret it as a iso-8859-5 (it seems to be russian) encoding and encode it in utf-8, I get something completely different from _РќРѕРІР°СЏ РїР°РїРєР°_. Moreover, the string _РќРѕРІР°СЏ РїР°РїРєР°_ has 22 characters while the original string _Новая папка_ has only 11, wich make me think there was no conversion to utf-8 (which is a variable length encoding). – Sylvain Defresne Aug 26 '11 at 23:26

How to turn std::string that contains utf-16 encoded text in it into utf-16 wstring?

2 Answers2