UTF8 data to std::string or std::wstring

Question

I receive the body bytes from an HTTP server response and I dont know how to convert them to an UTF8 string to work with them.

I have an idea but I am not sure wheter it works. I need to get the bytes of the response and search on them and modify them, so I need to transform the std::vector<BYTE> to std::wstring or std::string.

The bytes encoding in UTF8 of the response are in my std::vector<BYTE>, how can I transform them to a std::string? Shall I transform them to std::wstring?.

I found this code:

std::string Encoding::StringToUtf8(const std::string& str)
{
INT size = MultiByteToWideChar(CP_ACP, MB_COMPOSITE, str.c_str(), str.length(), NULL, 0);

std::wstring utf16_str(size, '\0');

MultiByteToWideChar(CP_ACP, MB_COMPOSITE, str.c_str(), str.length(), &utf16_str[0], size);

INT utf8_size = WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(), utf16_str.length(), NULL, 0, NULL, NULL);

std::string utf8_str(utf8_size, '\0');

WideCharToMultiByte(CP_UTF8, 0, utf16_str.c_str(), utf16_str.length(), &utf8_str[0], utf8_size, NULL, NULL);

return utf8_str;

}

But now if I want to search a character like "Ñ" in the string will work?, or Have I to transform the bytes in a std::wstring and search the "Ñ" modify the std::wstring and convert it to std::string?

Which of the two would be correct?

I need to put the UTF8 response in a std::string or std::wstring in order to search and modify the data (with special characters) and resend the response to the client in UTF8.

[std::codecvt](http://en.cppreference.com/w/cpp/locale/codecvt) might help. — felix, Mar 28 '17 at 11:41
BTW, `std::codecvt` is very slow, if critical, OS specific may be faster, in Windows case `MultiByteToWideChar` and friends are much faster — kreuzerkrieg, Mar 28 '17 at 11:50
I can quickly put it together, but... just a week ago did the profiling, I was so eager to get rid from OS specific code but alas, it was no go, as always `locale`s are screwing everything up. But if you insist... — kreuzerkrieg, Mar 28 '17 at 12:02
@felix AFAICT std::codecvt cannot portably convert between the native charset/encoding and Unicode. — n. m. could be an AI, Mar 28 '17 at 13:03

kreuzerkrieg · Accepted Answer · 2017-03-28T12:25:31.147

Storing utf-8 in the std::string is no more than storing sequence of bytes in "vector". The std::string is not aware of any encoding stuff whatsoever, and any member function like find or <algorithm> function like std::find would not work once you need to work beyond standard ASCII. So it is up to you how you gonna handle this situation, you can try and convert your input (L"Ñ") to utf-8 sequence and try to find it in std::string or you can convert your string to wstring and work directly on it. IMHO, in your case when you have to manipulate (search, extract words, split by letters or replace, and all this beyond ASCII range) the input you better stick to wstring and before posting it to client convert to utf-8 std::string
EDIT001: As of std::codecvt_utf8 mentioned above in a comment and my comment about performance concerns. Here is the test

std::wstring foo(const std::string& input)
{
    std::wstring_convert<std::codecvt_utf8<wchar_t>> converter;
    return converter.from_bytes(input.c_str());
}

std::wstring baz(const std::string& input)
{
    std::wstring retVal;
    auto targetSize = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()), NULL, 0);
    retVal.resize(targetSize);
    auto res = MultiByteToWideChar(CP_UTF8, 0, input.c_str(), static_cast<int>(input.size()),
                                   const_cast<LPWSTR>(retVal.data()), targetSize);
    if(res == 0)
    {
        // handle error, throw, do something...
    }
    return retVal;
}

int main()
{
    std::string input = "Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut "
                        "labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco "
                        "laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in "
                        "voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat "
                        "cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.";

    {
        auto start = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < 100'000; ++i)
        {
            auto result = foo(input);
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Elapsed time: " << res << std::endl;
    }

    {
        auto start = std::chrono::high_resolution_clock::now();
        for(int i = 0; i < 100'000; ++i)
        {
            auto result = baz(input);
        }
        auto end = std::chrono::high_resolution_clock::now();
        auto res = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Elapsed time: " << res << std::endl;
    }
    return 0;
}

Results when compiled and ran as Release x64
Elapsed time: 3065 Elapsed time: 29

Two orders of magnitude...

on windows, wchar_t represents a utf16 code unit, so it's equally complicated as storing utf8 in a std::string. — The Techel, Mar 28 '17 at 11:51
It depends what you gonna store... If storing a lot of Han characters `wstring` gonna consume less memory than `string`, if using extended ASCII the `string` will consume twice less memory than `wstring`. When mentioning `string` I mean `std::string` willed with utf-8 sequences — kreuzerkrieg, Mar 28 '17 at 11:54
TBH split by letter is a pain in UTF-8, -16 or -32. The problem is that you have combining characters so you can't say one codepoint=one letter. Unicode was a nice clean idea, but it started to derail well before it got to emoji's. — MSalters, Mar 28 '17 at 12:14
Typically, one cannot rely on a HTTP server to respond with one's system native character encoding, so the standard codecvt functionality will probably be of little use. Also `std::codecvt_utf8` will be deprecated in C++17 in favour of using `std::codecvt` directly. — eerorika, Mar 28 '17 at 12:52
"The std::string is not aware of any encoding stuff whatsoever" but `char` is, as `ctype` is locale dependent. Also if you try to output UTF8 to a text stream that cannot handle it, it's gonna fail. — n. m. could be an AI, Mar 28 '17 at 13:08
I think it is a question about handling strings, outputting and visualizing glyphs on the screen is completely another story — kreuzerkrieg, Mar 28 '17 at 13:14

eerorika · Answer 2 · 2017-03-28T13:26:49.130

I receive the body bytes from an HTTP server response and I dont know how to convert them to an UTF8 string to work with them.

You'll need to follow these steps:

Figure out the character encoding that the HTTP server responds with. The server should send the information in a header.
Get yourself a copy of the standard that specifies the encoding used by the server.
Get yourself a copy of the unicode standard.
Loop over each grapheme cluster and convert according to each spec.

The fourth step is obviously the least trivial one. The exact implementation depends on which encoding you are converting from. And it will be too broad for my answer.

It is usually cost effective to use an existing implementation, so that you don't have to do steps 2-4 yourself. The standard library has very limited conversion options (only between different unicode formats, and between native narrow and native wide), so you probably can not rely on it.

so I need to transform the std::vector to std::wstring

It makes very little sense to store UTF-8 encoded characters in a wide character string, since UTF-8 is a narrow character encoding.

But now if I want to search a character like "Ñ" in the string will work?

Sure, although keep in mind that the string algorithms of the C++ standard library don't consider encoding, so it might not be an option for implementing the search. Especially if you wish to search for any arbitrary grapheme cluster that consists of multiple code points. To properly search any UTF-8 character within a UTF-8 string, you need to:

Decide on the semantics of the comparison of the search. Should Ñ match N? How about canonical equivalence (normalized vs non-normalized version of the same character)?
If you wish to perform a trivial, exact byte for byte search, then the standard C++ functionality will be sufficient. Otherwise, go to 3.
Get yourself a copy of the unicode standard.
Loop over each grapheme cluster and compare it to the argument grahpeme cluster.

The fourth step is obviously the least trivial one. The exact implementation depends on what kind of semantics you need for the search. And it will be too broad for my answer.

It is usually cost effective to use an existing implementation, so that you don't have to do steps 3-4 yourself.

Which standard facility is responsinle for conversion between native encoding and unicode? — n. m. could be an AI, Mar 28 '17 at 13:15
"Get yourself a copy of the unicode standard", seriously? did you see the size of this book? I even cant lift it! :) — kreuzerkrieg, Mar 28 '17 at 13:17
@kreuzerkrieg it'll be easier to lift the pdf version in a memory stick :) — eerorika, Mar 28 '17 at 13:27

UTF8 data to std::string or std::wstring

2 Answers2