65

I've tried implementing a function like this, but unfortunately it doesn't work:

const wchar_t *GetWC(const char *c)
{
    const size_t cSize = strlen(c)+1;
    wchar_t wc[cSize];
    mbstowcs (wc, c, cSize);

    return wc;
}

My main goal here is to be able to integrate normal char strings in a Unicode application. Any advice you guys can offer is greatly appreciated.

AutoBotAM
  • 1,485
  • 3
  • 18
  • 24
  • 3
    What about the code you've given doesn't work? Please describe the problem you're having. I can guess what it is, but it would help if you would identify the problem. – Greg Hewgill Nov 07 '11 at 02:13
  • 2
    Note that `strlen()` is wrong, you would need to use something like `mbslen()` instead [which is done with `mbstowcs(NULL, ...)`]. UTF-8 does not have the same number of characters. Also wchar_t under MS-Windows uses UTF-16 which adds another fun thing to take in account when measuring the string length. – Alexis Wilke Oct 20 '14 at 01:28

9 Answers9

49

In your example, wc is a local variable which will be deallocated when the function call ends. This puts you into undefined behavior territory.

The simple fix is this:

const wchar_t *GetWC(const char *c)
{
    const size_t cSize = strlen(c)+1;
    wchar_t* wc = new wchar_t[cSize];
    mbstowcs (wc, c, cSize);

    return wc;
}

Note that the calling code will then have to deallocate this memory, otherwise you will have a memory leak.

Andrew Shepherd
  • 44,254
  • 30
  • 139
  • 205
  • 5
    it's not nice to teach novices bad practices like using raw `new`. you should at the very least mention what that entails. and alternatives. – Cheers and hth. - Alf Nov 07 '11 at 02:36
  • 1
    @Alexis Wilke - Care to explain more? – Andrew Shepherd Oct 20 '14 at 08:00
  • 2
    `strlen()` on an mbstring does not return the size of the wstring. You would need to do `cSize = mbstowcs(NULL, c, 0) + 1;` to get the correct size. – Alexis Wilke Oct 20 '14 at 10:33
  • 1
    @Alexis Wilke - Why do you think that `c` is an mbstring? – Andrew Shepherd Oct 20 '14 at 11:46
  • 1
    Why do you use `mbstowcs()` otherwise?! If the locale changes on your, multiple bytes may represent a single UTF-16 character and other sequences may represent two entries in UTF-16 (some Chinese and such is encoded using 4 bytes per character.) – Alexis Wilke Oct 20 '14 at 22:43
  • If you use `std::unique_ptr wa(new wchar_t[cSize])` you won't have to manually delete it later. – a paid nerd Jun 02 '15 at 21:21
  • 1
    Now `mbstowcs_s(&outSize, wc, cSize, c, cSize - 1);` should be used instead of `mbstowcs (wc, c, cSize);` but it works well! – Shlublu Jul 02 '16 at 21:25
  • 1
    So the solution proposed here is to introduce memory leak to the code without any explanation on how to fix it. – masiton Aug 31 '20 at 08:24
  • In cases like this, the best is to give the responsibility to the caller to allocate and provide a pointer to a big enough array of wchar_t. – Poniros Jan 05 '23 at 16:49
47

Use a std::wstring instead of a C99 variable length array. The current standard guarantees a contiguous buffer for std::basic_string. E.g.,

std::wstring wc( cSize, L'#' );
mbstowcs( &wc[0], c, cSize );

C++ does not support C99 variable length arrays, and so if you compiled your code as pure C++, it would not even compile.

With that change your function return type should also be std::wstring.

Remember to set relevant locale in main.

E.g., setlocale( LC_ALL, "" ).

halfer
  • 19,824
  • 17
  • 99
  • 186
Cheers and hth. - Alf
  • 142,714
  • 15
  • 209
  • 331
  • Won't you require C++11 to guarantee that the string buffer is stored contiguously? – Kerrek SB Nov 07 '11 at 02:17
  • As stated in the answer, the current standard guarantees that, yes. The proposal was adopted at the Lillehammer meeting in April 2005. – Cheers and hth. - Alf Nov 07 '11 at 02:25
  • I added your code snippet and the function seems to work now! Although I didn't have to return a `wstring` as I just used the `c_str()` member function and returned that. It seems I also didn't need to call `setlocale` as the defaults seem to suffice for now. – AutoBotAM Nov 07 '11 at 18:14
  • @AutoBotAM: By returning `wstring.c_str()`, you will have the same problem all over again. When a function exits, all local variables are *destroyed*, including those of type `wstring`. The return value of `c_str()` is only valid for the lifetime of its corresponding `wstring` object. Although your code might *look* like it runs correctly, it is accessing memory that has been freed and it will mysteriously fail some day for reasons that are not obvious to you at the time. – Greg Hewgill Nov 07 '11 at 20:49
  • It did look like it worked, but I suppose memory issues like this can be deceiving as you say. I'll probably store the returned string in a global buffer or array of some sort so that the memory is always consistent. I just don't like the use of the new() operator and leaving it up to the user to delete it; that's very welcome to memory leaks, especially if someone else is using my function. – AutoBotAM Nov 09 '11 at 00:00
  • @AutoBotAM better yet, just return the `wstring` by value and let RAII do it's thing – Jan Hohenheim Nov 28 '16 at 08:23
  • 3
    The question clearly states that target type is `wchar_t`, not `wstring`. Why si this offtopic answer marked as a solution? – masiton Aug 31 '20 at 08:25
8
const char* text_char = "example of mbstowcs";
size_t length = strlen(text_char );

Example of usage "mbstowcs"

std::wstring text_wchar(length, L'#');

//#pragma warning (disable : 4996)
// Or add to the preprocessor: _CRT_SECURE_NO_WARNINGS
mbstowcs(&text_wchar[0], text_char , length);

Example of usage "mbstowcs_s"

Microsoft suggest to use "mbstowcs_s" instead of "mbstowcs".

Links:

Mbstowcs example

mbstowcs_s, _mbstowcs_s_l

wchar_t text_wchar[30];

mbstowcs_s(&length, text_wchar, text_char, length);
Maks
  • 2,808
  • 4
  • 30
  • 31
2

You're returning the address of a local variable allocated on the stack. When your function returns, the storage for all local variables (such as wc) is deallocated and is subject to being immediately overwritten by something else.

To fix this, you can pass the size of the buffer to GetWC, but then you've got pretty much the same interface as mbstowcs itself. Or, you could allocate a new buffer inside GetWC and return a pointer to that, leaving it up to the caller to deallocate the buffer.

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • The C++ way is to not do raw `new` (in general). E.g. `std::wstring` is the natural result type here. At least when you don't have anything better. Also, with the code regarded as C++ he is not returning anything. The code won't compile as C++. – Cheers and hth. - Alf Nov 07 '11 at 02:22
1

I did something like this. The first 2 zeros are because I don't know what kind of ascii type things this command wants from me. The general feeling I had was to create a temp char array. pass in the wide char array. boom. it works. The +1 ensures that the null terminating character is in the right place.

char tempFilePath[MAX_PATH] = "I want to convert this to wide chars";

int len = strlen(tempFilePath);

// Converts the path to wide characters
    int needed = MultiByteToWideChar(0, 0, tempFilePath, len + 1, strDestPath, len + 1);
Gandalf458
  • 2,139
  • 1
  • 21
  • 36
1

Andrew Shepherd 's answer.

Andrew Shepherd 's answer is Good for me, I add up some fix : 1, remove the ending char L'\0', casue sometime it will trouble. 2, use mbstowcs_s

std::wstring wtos(std::string& value){
    const size_t cSize = value.size() + 1;

    std::wstring wc;
    wc.resize(cSize);

    size_t cSize1;
    mbstowcs_s(&cSize1, (wchar_t*)&wc[0], cSize, value.c_str(), cSize);

    wc.pop_back();

    return wc;
}
1

The question has several problems, but so do some of the answers. The idea of returning a pointer to allocated memory "and leaving it up to the caller to de-allocate" is asking for trouble. As a rule the best pattern is always to allocate and de-allocate within the same function. For example, something like:

wchar_t* buffer = new wchar_t[get_wcb_size(str)];
mbstowcs(buffer, str, get_wcb_size(str) + 1);
...
delete[] buffer;

In general, this requires two functions, one the caller calls to find out how much memory to allocate and a second to initialize or fill the allocated memory. Unfortunately, the basic idea of using a function to return a "new" object is problematic -- not inherently, but because of the C++ inheritance of C memory handling. Using C++ and STL's strings/wstrings/strstreams is a better solution, but I felt the memory allocation thing needed to be better addressed.

Cleverhans
  • 19
  • 1
  • 1
    `get_wcb_size` doesn't seem to be a thing. If you google it, this answer is the *only* result. – Mgamerz May 22 '22 at 16:40
0
auto Ascii_To_Wstring = [](int code)->std::wstring
{
    if (code>255 || code<0 )
    {
        throw std::runtime_error("Incorrect ASCII code");
    }
    std::string s{ char(code) };

    std::wstring w{ s.begin(),s.end() };
    return w;
};
abdulrhmanOmran
  • 111
  • 1
  • 7
0

Your problem has nothing to do with encodings, it's a simple matter of understanding basic C++. You are returning a pointer to a local variable from your function, which will have gone out of scope by the time anyone can use it, thus creating undefined behaviour (i.e. a programming error).

Follow this Golden Rule: "If you are using naked char pointers, you're Doing It Wrong. (Except for when you aren't.)"

I've previously posted some code to do the conversion and communicating the input and output in C++ std::string and std::wstring objects.

Community
  • 1
  • 1
Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084