6

How can I compare a wstring, such as L"Hello", to a string? If I need to have the same type, how can I convert them into the same type?

Mateen Ulhaq
  • 24,552
  • 19
  • 101
  • 135
aliakbarian
  • 709
  • 1
  • 11
  • 20

3 Answers3

7

Since you asked, here's my standard conversion functions from string to wide string, implemented using C++ std::string and std::wstring classes.

First off, make sure to start your program with set_locale:

#include <clocale>

int main()
{
  std::setlocale(LC_CTYPE, "");  // before any string operations
}

Now for the functions. First off, getting a wide string from a narrow string:

#include <string>
#include <vector>
#include <cassert>
#include <cstdlib>
#include <cwchar>
#include <cerrno>

// Dummy overload
std::wstring get_wstring(const std::wstring & s)
{
  return s;
}

// Real worker
std::wstring get_wstring(const std::string & s)
{
  const char * cs = s.c_str();
  const size_t wn = std::mbsrtowcs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  std::vector<wchar_t> buf(wn + 1);
  const size_t wn_again = std::mbsrtowcs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in mbsrtowcs(): " << errno << std::endl;
    return L"";
  }

  assert(cs == NULL); // successful conversion

  return std::wstring(buf.data(), wn);
}

And going back, making a narrow string from a wide string. I call the narrow string "locale string", because it is in a platform-dependent encoding depending on the current locale:

// Dummy
std::string get_locale_string(const std::string & s)
{
  return s;
}

// Real worker
std::string get_locale_string(const std::wstring & s)
{
  const wchar_t * cs = s.c_str();
  const size_t wn = std::wcsrtombs(NULL, &cs, 0, NULL);

  if (wn == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  std::vector<char> buf(wn + 1);
  const size_t wn_again = std::wcsrtombs(buf.data(), &cs, wn + 1, NULL);

  if (wn_again == size_t(-1))
  {
    std::cout << "Error in wcsrtombs(): " << errno << std::endl;
    return "";
  }

  assert(cs == NULL); // successful conversion

  return std::string(buf.data(), wn);
}

Some notes:

  • If you don't have std::vector::data(), you can say &buf[0] instead.
  • I've found that the r-style conversion functions mbsrtowcs and wcsrtombs don't work properly on Windows. There, you can use the mbstowcs and wcstombs instead: mbstowcs(buf.data(), cs, wn + 1);, wcstombs(buf.data(), cs, wn + 1);


In response to your question, if you want to compare two strings, you can convert both of them to wide string and then compare those. If you are reading a file from disk which has a known encoding, you should use iconv() to convert the file from your known encoding to WCHAR and then compare with the wide string.

Beware, though, that complex Unicode text may have multiple different representations as code point sequences which you may want to consider equal. If that is a possibility, you need to use a higher-level Unicode processing library (such as ICU) and normalize your strings to some common, comparable form.

Kerrek SB
  • 464,522
  • 92
  • 875
  • 1,084
  • I've updated the post to get rid off the variable-length arrays. Check it again with the most recent version. Sorry for that - I used the VLAs in my private code, but for public consumption it's better to use the vectors as I do here. I also added a few more required headers. – Kerrek SB Aug 23 '11 at 11:21
  • thank you for your response..I convert my utf8 string to wstring by your functions but my comparision to other wstring failed. but yesterday I found that my real problem is to convert utf8 std::string to utf16 std::wstring and that solved my problem: http://stackoverflow.com/questions/7153935/how-to-convert-utf-8-stdstring-to-utf-16-stdwstring – aliakbarian Aug 23 '11 at 11:38
  • The C++ `string` and `wstring` classes, as well as `mbstowcs`/`wcstombs`, are entirely encoding agnostic. You have no control over what encoding any of the strings end up in. If you need a definite encoding, you need to use something like `iconv()` to convert from WCHAR to the definite encoding. – Kerrek SB Aug 23 '11 at 11:41
  • can you tell me how can I use iconv()? – aliakbarian Aug 23 '11 at 11:50
  • I guess I could, but not in a comment -- can't you look up the documentation, or search the internet? There must be a bajillion examples out there. On Linux, type `man 3 iconv` for a synopsis. You have to use `iconv_open()`, `iconv()` and `iconv_close()` in that order. – Kerrek SB Aug 23 '11 at 12:02
  • @KerrekSB +1 I've modified this answer to use it over [here](http://stackoverflow.com/a/33084843/2642059) wan't sure but if you think it's a duplicate I wanted you to know so you could close it. – Jonathan Mee Oct 12 '15 at 15:31
3

Think twice before doing this — you might not want to compare them in the first place. If you are sure you do and you are using Windows, then convert string to wstring with MultiByteToWideChar, then compare with CompareStringEx.

If you are not using Windows, then the analogous functions are mbstowcs and wcscmp. The standard wide character C++ functions are often not portable under Windows; for instance mbstowcs is deprecated.

The cross-platform way to work with Unicode is to use the ICU library.

Take care to use special functions for Unicode string comparison, don't do it manually. Two Unicode strings could have different characters, yet still be the same.

wstring ConvertToUnicode(const string & str)
{
    UINT  codePage = CP_ACP;
    DWORD flags    = 0;
    int resultSize = MultiByteToWideChar
        ( codePage     // CodePage
        , flags        // dwFlags
        , str.c_str()  // lpMultiByteStr
        , str.length() // cbMultiByte
        , NULL         // lpWideCharStr
        , 0            // cchWideChar
        );
    vector<wchar_t> result(resultSize + 1);
    MultiByteToWideChar
        ( codePage     // CodePage
        , flags        // dwFlags
        , str.c_str()  // lpMultiByteStr
        , str.length() // cbMultiByte
        , &result[0]   // lpWideCharStr
        , resultSize   // cchWideChar
        );
    return &result[0];
}
Don Reba
  • 13,814
  • 3
  • 48
  • 61
  • 2
    The standard wide-character functions are not just "not portable under windows". They are non-portable period. The encodings used by these functions (both the "multi-byte" and the "wide-char" encodings) are totally implementation defined. – André Caron Aug 21 '11 at 22:20
3

You should convert the char string to a wchar_t string using mbstowcs, and then compare the resulting strings. Notice that mbstowcs works on char */wchar *, so you'll probably need to do something like this:

std::wstring StringToWstring(const std::string & source)
{
    std::wstring target(source.size()+1, L' ');
    std::size_t newLength=std::mbstowcs(&target[0], source.c_str(), target.size());
    target.resize(newLength);
    return target;
}

I'm not entirely sure that that usage of &target[0] is entirely standard-conforming, if someone has a good answer to that please tell me in the comments. Also, there's an implicit assumption that the converted string won't be longer (in number of wchar_ts) than the number of chars of the original string - a logical assumption that still I'm not sure it's covered by the standard.

On the other hand, it seems that there's no way to ask to mbstowcs the size of the needed buffer, so either you go this way, or go with (better done and better defined) code from Unicode libraries (be it Windows APIs or libraries like iconv).

Still, keep in mind that comparing Unicode strings without using special functions is slippery ground, two equivalent strings may be evaluated different when compared bitwise.

Long story short: this should work, and I think it's the maximum you can do with just the standard library, but it's a lot implementation-dependent in how Unicode is handled, and I wouldn't trust it a lot. In general, it's just better to stick with an encoding inside your application and avoid this kind of conversions unless absolutely necessary, and, if you are working with definite encodings, use APIs that are less implementation-dependent.

Matteo Italia
  • 123,740
  • 17
  • 206
  • 299
  • 3
    `wstring` cannot be constructed on a single number, and even if it could, you'd end up with a string of the wrong length. You really have to call `mbstowcs` twice to get the required target length. – Kerrek SB Aug 21 '11 at 21:33
  • 1
    Note that if the string holds non-English text or UTF-8, this might not work. In windows there's MBTWC, as @Don Reba suggests, but it's obviously not portable, if that matters to the OP. – Eran Aug 21 '11 at 21:33
  • 3
    @eran: `mbstowcs` is entirely encoding agnostic. It merely translates between "the system's multibyte representation" and "the system's wide character". There is no requirement for either of those to be anything you'd recognize. If you want a definite encoding, you have to follow this up with an `iconv()` conversion from WCHAR to your favourite (Unicode?) encoding. Here is a [little rant](http://stackoverflow.com/questions/6300804/wchars-encodings-standards-and-portability) of mine on the subject. – Kerrek SB Aug 21 '11 at 21:36
  • @Kerrek: I forgot the second parameter. The background assumption there was that the `wchar_t` string couldn't be longer than the `char` string for obvious reasons, but I don't think that's guaranteed, and also I should have cut the resulting string after knowing the result of `mbtowcs`. I'll rewrite the example. – Matteo Italia Aug 21 '11 at 21:40
  • 1
    @Kerrek, I'd upvote that rant of yours, but I can't. I already did that about a month and a half ago... – Eran Aug 21 '11 at 21:41
  • @Matteo: Yeah, I guess you can get away with the assumption that you never get more wchars than you have chars. I don't personally like it and it's not mandated by anything, but you probably can't break it either. :-) Eran: thanks! – Kerrek SB Aug 21 '11 at 21:42
  • @Kerrek SB: personally I think that someone should create a *standard conforming* C++ implementation that did the weirdest things allowed, including converting less `char`s to more `wchar_t`s, just to see how much stuff would break. :P – Matteo Italia Aug 21 '11 at 21:45
  • @Matteo: Yeah, I'm really surprised that there's no C++ version of `mbstowcs`. I could post my own standard version, but I think I already did that in some other question... – Kerrek SB Aug 21 '11 at 21:50
  • @Kerrek SB: can you post a simple program to show your suggestion? – aliakbarian Aug 21 '11 at 22:09
  • @Matteo: wide string literals are `L""` and wide chars are `L''`, not `''L` -- you're thinking of long doubles :-) See [this other rant of mine](http://stackoverflow.com/questions/6796157/unicode-encoding-for-string-literals-in-c0x) on literals. – Kerrek SB Aug 21 '11 at 22:18
  • @Kerrek: ugh, let's fix this too, this night I'm not writing a single correct thing `:(`. In my defense, I tend to forget about this stuff because on Windows I tend to use the `_T()` macro (for `TCHAR`s), and on Linux I just use UTF8 `char`s. – Matteo Italia Aug 21 '11 at 22:22
  • It might be better to just allocate a wchar* buffer and then construct an std::wstring from that instead of making the std::wstring and directly writing to its buffer. – RétroX Aug 21 '11 at 22:22
  • 1
    @RétroX: the idea of writing directly into the `std::wstring` is to avoid wasting time with another heap allocation/deallocation, which, if the conversion is done often, could impact the performance. The problem is that, although I saw that idiom more than once, I'm not sure if that thing is actually allowed by the standard. – Matteo Italia Aug 21 '11 at 22:24
  • @Kerrek: I'm with you with your rants, it seems that "vanilla C++" never managed to get Unicode right. – Matteo Italia Aug 21 '11 at 22:27
  • 1
    @Matteo: Hmm... actually, I think that the new C++ (and I guess C1x) get this right: It's the C++ philosophy that the *language facilitates library writing*. Everything should be done in a library if possible, and the language itself should only be modified when absolutely necessary (e.g. lambdas, rvalue references). Unicode is a perfect use case: It's an extremely complex, subtle subject concerning the nature of written text, and it's best left to a good, dedicated library. C++ provides the tools in the form of `char32_t` etc., but all higher-level stuff is up to the library... – Kerrek SB Aug 21 '11 at 22:33