2

I have a windows application where string types are WCHAR*. I need to convert this into char* for passing into a C API. I am using MultiByteToWideChar and WideCharToMultiByte functions to perform the conversion.

But for some reason, the conversion is not proper. I am seeing lot of gibberish in the output. Following code is a modified version found in this stackoverflow answer.

WCHAR* convert_to_wstring(const char* str)
{
    int size_needed = MultiByteToWideChar(CP_UTF8, 0, str, (int)strlen(str), NULL, 0);
    WCHAR* wstrTo = (WCHAR*)malloc(size_needed);
    MultiByteToWideChar(CP_UTF8, 0, &str[0], (int)strlen(str), wstrTo, size_needed);
    return wstrTo;
}

char* convert_from_wstring(const WCHAR* wstr)
{
    int size_needed = WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), NULL, 0, NULL, NULL);
    char* strTo = (char*)malloc(size_needed);
    WideCharToMultiByte(CP_UTF8, 0, wstr, (int)wcslen(wstr), strTo, size_needed, NULL, NULL);
    return strTo;
}

int main()
{
    const WCHAR* wText = L"Wide string";
    const char* text = convert_from_wstring(wText);
    std::cout << text << "\n";
    std::cout << convert_to_wstring("Multibyte string") << "\n";
    return 0;
}
Community
  • 1
  • 1
Navaneeth K N
  • 15,295
  • 38
  • 126
  • 184
  • 1
    Why did you feel the need to re-write the other code, instead of just use it as-is? Clearly you have not read the documetnation for the functions being used. And you do realize that you can't output a `WCHAR*` using `std:cout`, don't you? you would need to use `std::wcout` instead. – Remy Lebeau Mar 14 '17 at 18:28
  • What do you expect to see when using `cout` to output `WCHAR*`? – Paul Mar 14 '17 at 18:30
  • I could. Only reason is in that code it uses std::string and std::wstring. It is another level of indirection for me. Because all the other parts of code uses char* and WCHAR*. – Navaneeth K N Mar 14 '17 at 18:30
  • 1
    @Appu: well, then you are not really taking advantage of C++, are you? You may has well just code in C instead. – Remy Lebeau Mar 14 '17 at 18:31
  • @RemyLebeau Yes. I know it. It is a legacy code base and changes are not that easy to make. – Navaneeth K N Mar 14 '17 at 18:35
  • 2
    So what if the system is largely based on `char*`/`wchar_t*`? That is what the `std::string::c_str()` and `std::wstring::c_str()` are meant for. Let the STL handle the memory management for you (especially since you are getting it all wrong anyway). – Remy Lebeau Mar 14 '17 at 18:41

2 Answers2

13

Your conversion functions are buggy.

The return value of MultiByteToWideChar() is a number of wide characters, not a number of bytes like you are currently treating it. You need to multiple the value by sizeof(WCHAR) when calling malloc().

You are also not taking into account that the return value DOES NOT include space for a null terminator, because you are not passing -1 in the cbMultiByte parameter. Read the MultiByteToWideChar() documentation:

cbMultiByte [in]
Size, in bytes, of the string indicated by the lpMultiByteStr parameter. Alternatively, this parameter can be set to -1 if the string is null-terminated. Note that, if cbMultiByte is 0, the function fails.

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting Unicode string has a terminating null character, and the length returned by the function includes this character.

If this parameter is set to a positive integer, the function processes exactly the specified number of bytes. If the provided size does not include a terminating null character, the resulting Unicode string is not null-terminated, and the returned length does not include this character.

...

Return value

Returns the number of characters written to the buffer indicated by lpWideCharStr if successful. If the function succeeds and cchWideChar is 0, the return value is the required size, in characters, for the buffer indicated by lpWideCharStr.

You are not null-terminating your output string.

The same goes with your convert_from_wstring() function. Read the WideCharToMultiByte() documentation:

cchWideChar [in]
Size, in characters, of the string indicated by lpWideCharStr. Alternatively, this parameter can be set to -1 if the string is null-terminated. If cchWideChar is set to 0, the function fails.

If this parameter is -1, the function processes the entire input string, including the terminating null character. Therefore, the resulting character string has a terminating null character, and the length returned by the function includes this character.

If this parameter is set to a positive integer, the function processes exactly the specified number of characters. If the provided size does not include a terminating null character, the resulting character string is not null-terminated, and the returned length does not include this character.

...

Return value

Returns the number of bytes written to the buffer pointed to by lpMultiByteStr if successful. If the function succeeds and cbMultiByte is 0, the return value is the required size, in bytes, for the buffer indicated by lpMultiByteStr.

That being said, your main() code is leaking the allocated strings. Since they are allocated with malloc(), you need to deallocate them with free() when you are done using them:

Also, you cannot pass a WCHAR* string to std::cout. Well, you can, but it has no operator<< for wide string input, but it does have an operator<< for void* input, so it will just end up outputting the memory address that the WCHAR* is pointing at, not the actual characters. If you want to output wide strings, use std::wcout instead.

Try something more like this:

WCHAR* convert_to_wstring(const char* str)
{
    int str_len = (int) strlen(str);
    int num_chars = MultiByteToWideChar(CP_UTF8, 0, str, str_len, NULL, 0);
    WCHAR* wstrTo = (WCHAR*) malloc((num_chars + 1) * sizeof(WCHAR));
    if (wstrTo)
    {
        MultiByteToWideChar(CP_UTF8, 0, str, str_len, wstrTo, num_chars);
        wstrTo[num_chars] = L'\0';
    }
    return wstrTo;
}

CHAR* convert_from_wstring(const WCHAR* wstr)
{
    int wstr_len = (int) wcslen(wstr);
    int num_chars = WideCharToMultiByte(CP_UTF8, 0, wstr, wstr_len, NULL, 0, NULL, NULL);
    CHAR* strTo = (CHAR*) malloc((num_chars + 1) * sizeof(CHAR));
    if (strTo)
    {
        WideCharToMultiByte(CP_UTF8, 0, wstr, wstr_len, strTo, num_chars, NULL, NULL);
        strTo[num_chars] = '\0';
    }
    return strTo;
}

int main()
{
    const WCHAR* wText = L"Wide string";
    const char* text = convert_from_wstring(wText);
    std::cout << text << "\n";
    free(text);

    const WCHAR *wtext = convert_to_wstring("Multibyte string");
    std::wcout << wtext << "\n";
    free(wtext);

    return 0;
}

That being said, you really should be using std::string and std::wstring instead of char* and wchar_t* for better memory management:

std::wstring convert_to_wstring(const std::string &str)
{
    int num_chars = MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), NULL, 0);
    std::wstring wstrTo;
    if (num_chars)
    {
        wstrTo.resize(num_chars);
        MultiByteToWideChar(CP_UTF8, 0, str.c_str(), str.length(), &wstrTo[0], num_chars);
    }
    return wstrTo;
}

std::string convert_from_wstring(const std::wstring &wstr)
{
    int num_chars = WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), NULL, 0, NULL, NULL);
    std::string strTo;
    if (num_chars > 0)
    {
        strTo.resize(num_chars);
        WideCharToMultiByte(CP_UTF8, 0, wstr.c_str(), wstr.length(), &strTo[0], num_chars, NULL, NULL);
    }
    return strTo;
}

int main()
{
    const WCHAR* wText = L"Wide string";
    const std::string text = convert_from_wstring(wText);
    std::cout << text << "\n";

    const std::wstring wtext = convert_to_wstring("Multibyte string");
    std::wcout << wtext << "\n";

    return 0;
}

If you are using C++11 or later, have a look at the std::wstring_convert class for converting between UTF strings, eg:

std::wstring convert_to_wstring(const std::string &str)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conv;
    return conv.from_bytes(str);
}

std::string convert_from_wstring(const std::wstring &wstr)
{
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>, wchar_t> conv;
    return conv.to_bytes(wstr);
}

If you need to interact with other code that is based on char*/wchar_t*, std::string as a constructor for accepting char* input and a c_str() method that can be used for char* output, and the same goes for std::wstring and wchar_t*.

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
  • 2
    Has anyone done performance comparison of recent implementation of `wstring_convert` vs. `MultiByteToWideChar()` and `WideCharToMultiByte()`? – zett42 Mar 14 '17 at 19:05
  • The C++11 examples saved me from hours of frustration... kudos for that Remy. However I found another "version" with some error handling, maybe you want to add it into your answer ? I put it on pastebin: https://pastebin.com/hq2grb84 – Mecanik Aug 29 '18 at 16:16
0

The easiest way to do this on Windows is to use _bstr_ from

#include <comdef.h> // _bstr_t

#include <string>
#include <iostream>

std::wstring convert_to_wstring(const char* str)
{
    return _bstr_t(str);
}

std::string convert_from_wstring(const WCHAR* wstr)
{
    return _bstr_t(wstr);
}

int main()
{
    const auto text = convert_from_wstring(L"Wide string");
    const auto wide_text = convert_to_wstring("Multibyte string");
    return 0;
}

Also, notice how much easier it is to return std::wstring and std::string.

Ðаn
  • 10,934
  • 11
  • 59
  • 95
  • The documentation on `_bstr_t::_bstr_t(const char*)` is ambiguous: *This constructor first performs a multibyte to Unicode conversion.*. Which encoding does it use to convert from multibyte to Unicode? – Paul Mar 14 '17 at 18:33
  • I assume it will use OEM encoding and not UTF8. And OP is converting to/from UTF8 string. – Paul Mar 14 '17 at 18:38
  • @Paul I doubt it uses the OEM codepage, it probably uses the "ANSI" codepage (CP_ACP). – Anders Mar 14 '17 at 22:39
  • @Anders Doesn't matter in this context, it still doesn't solve OP's task (convert UTF8 string to/from UTF16 string). – Paul Mar 15 '17 at 02:39
  • 1
    `_bstr_t` is not a general Windows solution, it is a Visual Studio specific solution. Other Windows compilers may not have it. And worse, since it wraps a BSTR, it uses the COM memory manager instead of the compiler's own RTL memory manager. And then it has to be copied to `std::wstring`, so you may be using mess code but you are doing a double allocation. – Remy Lebeau Mar 15 '17 at 14:38