3

I am having a little bit of trouble trying to implement a certain assignment based on which string argv[1] equals.

int _tmain(int argc, _TCHAR* argv[]) //wchar_t
{   
    if (argc != 2)
        exit(1);

    if (argv[1] == L"-foo")
        printf("Success!\n");

    wprintf(argv[1]);
    printf("\n");

    system("pause");
    return 0;
}

If I run the executable with the argument "-foo", I receive the following output:

-foo

It should be:

Success!
-foo

The string is exactly how I want it to be, but the if-condition remains to be false. Are wchar_t strings simply not comparable using the == operator? If so, how do I compare them properly?

ildjarn
  • 62,044
  • 9
  • 127
  • 211
Byzantian
  • 3,208
  • 4
  • 27
  • 31
  • 1. Strings cannot be compared that way. 2. See this: http://www.codeproject.com/Articles/76252/What-are-TCHAR-WCHAR-LPSTR-LPWSTR-LPCTSTR-etc – Ajay Jul 11 '12 at 18:01

2 Answers2

4

Preliminary Note: Unicode and Unicode character in this answer, given the context of the question itself, refers to the UCS-2 (up to XP) and UTF-16 (starting with XP) encodings, used interchangeably with wide character, wchar_t, WCHAR and other terms in the context of the Win32 API. The Unicode standards offer multiple encodings such as UTF-8, UTF-16 and UTF-32 to encode the same number of characters - different incarnations of the standard have a different scope. Surrogate code points are used to escape from the Basic Multilingual Plane (BMP), roughly the first 64K code points, and thus encode more than could be encoded with 16bit characters and one character per code-point. The surrogate extensions were developed for the Unicode 2.0 standard, which was passed in the year NT 4.0 was released, but some years after the first "Unicode-capable" version of Windows, NT 3.51, got released. That original standard didn't account for more characters than the BMP and that is why Unicode character or wide character are even now used synonymous with Unicode in the Win32 API context, although this is inaccurate.

To answer the underlying question you raised:

Are wchar_t strings simply not comparable using the "==" operator?

No they aren't, neither are "ANSI" strings, i.e. using the char type as the basis. Remember, a C string (both wchar_t and char based) is a pointer. This means with == you were comparing two pointer values that were definitely not equal. One, after all, is a literal string (i.e. inside your program image) while the other is allocated somewhere on the heap. So they are definitely two different entities.

If you wanted to use the == you would have to use a language such as C++ with the STL class std::string (or std::basic_string<_TCHAR>) or (on Windows) the ATL class CString (or rather CStringT). These classes are sometimes referred to as smart string classes and use the C++ facility of overriding the operator==(). However, you should keep in mind that semantics differ depending on implementation, so not every smart string class will compare the string contents. Some might merely compare the equality of this (i.e. is it the same instance), while others may compare the string contents case-insensitive or case-sensitive at their discretion.

To compare C strings you have the following functions available for your use-case:

  • For "ANSI" character (char) strings: strcmp, _stricmp (and the "counted" variants: _strncmp, _strnicmp ... there are more)
  • For Unicode character (wchar_t) strings: wcscmp, _wcsicmp (and the "counted" variants: _wcsncmp, _wcsnicmp ... there are more)
  • For the variable character"type" (TCHAR) strings: _tcscmp, _tcsicmp (and the "counted" variants: _tcsncmp, _tcsnicmp ... there are more)

You can remember these prefixes:

  • str -> string
  • wcs -> wide character string
  • tcs -> T character string

Side note: with #include <tchar.h> and windows.h the macros TEXT and _T are equivalent and used to declare a string literal that will either be "ANSI" or Unicode depending on the defines at build-time. The same holds for _TCHAR and TCHAR apparently, whereas the latter appears to be favored in the Win32 API context.

So a Unicode build will expand _T("something") to L"something", while the "ANSI" build will expand it to "something".

As to TCHAR, consider reading through the arguments put forth in: Is TCHAR still relevant? (pointed out by rubenvb) There are valid points for and against TCHAR/_TCHAR use and you should make a decision and stick with it - i.e. be consistent.

Community
  • 1
  • 1
0xC0000022L
  • 20,597
  • 9
  • 86
  • 152
  • `wchar_t` is not Unicode, despite what MS wants you toe believe. – rubenvb Jul 11 '12 at 21:27
  • @rubenvb: I have no clue what you are even trying to say with that assertion or how it would justify a downvote, but whatever. "Unicode" and the term "Unicode character", despite what *others* want you to believe, in the Win32 API context refers to wide characters and there in particular to the varieties `wchar_t` (compiler-defined) and `WCHAR` (header-defined). The fact that the used encoding for Unicode here is UTF-16 and used to be UCS-2 is more than most people want to know or need to know and moreover it's not relevant to this answer. – 0xC0000022L Jul 12 '12 at 11:59
  • Unicode has several encodings and I know that very well. There is only one encoding that is Endianess-neutral (UTF-8), there is an older encoding (and subset of UTF-16) called UCS-2 which is what Microsoft originally used starting with Windows NT, using a 16bit character type (limited to the BMP). Starting with XP they moved to UTF-16. Unlike UTF-16 UCS-2 didn't have code points that allowed you to encode more than the first 64K characters (BMP). UTF-16 and UTF-32 on the other hand have code points just like UTF-8 to allow for more characters than the BMP can offer. – 0xC0000022L Jul 12 '12 at 12:01
  • The Unicode `wchar_t` was only a nitpick. `_TCHAR` and `std::base_string` are mistakes. And you're missing the recommendation to not use the silly `TCHAR` business. – rubenvb Jul 12 '12 at 15:22
  • @rubenvb: there is nothing silly about `TCHAR` when it is used consistently. I still have to maintain code myself that for several reasons cannot be moved to a newer version of VS (mostly because of the runtime libs) and gets compiled as ANSI. However, parts of that code will be reused later and therefore using `TCHAR` is a very prudent approach. *If* the OP had posted the question with `wmain` I would have answered accordingly. In the context of the question `TCHAR` is correct. Also what are the mistakes you are referring to? You could at least explain the criticism since it is by not obvious – 0xC0000022L Jul 12 '12 at 15:31
  • I see your firewall is blocking google. It's [`std::basic_string`](http://en.cppreference.com/w/cpp/string/basic_string) and `TCHAR` was what I thought it was, but apparently MS decided to also define `_TCHAR` elsewhere. – rubenvb Jul 12 '12 at 15:35
  • it should have been `basic_string`, not `base_string` my typo. No code-completion in the browser. Doh! :) – 0xC0000022L Jul 12 '12 at 15:44
0

Nevermind, got it.

if (wcscmp(argv[1], L"-foo") == 0)
Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
Byzantian
  • 3,208
  • 4
  • 27
  • 31
  • Actually your solution was wrong, it should have been `_tcscmp()` instead of `wcscmp` since you are using `_TCHAR`. Corrected that. – 0xC0000022L Jul 11 '12 at 17:15
  • @Hans Passant: can you explain why you undid my change? I'm genuinely curious, because I think it is a very *bad habit* to have `_tmain` and everything else set up for conditional compilation as ANSI vs. Unicode and then throw in **Unicode or ANSI-specific code** the way the OP did. So please, I'm curious what the rationale is, given the context of the question (use of `_tmain` and `_TCHAR`). Thanks. – 0xC0000022L Jul 11 '12 at 17:33
  • 2
    @0xc0 - TCHAR is archaic, there is no running version of Windows left that isn't Unicode at its core. _tmain is produced by the project template. – Hans Passant Jul 11 '12 at 17:38
  • @HansPassant: I get that, but I hold up that unless the code gets changed to `wmain` for completeness, it's bad habit to mix character-specific functions with those that depend on compile-time defines. And just because the core is Unicode doesn't mean there isn't a lot of ANSI-only code still out there. – 0xC0000022L Jul 11 '12 at 17:43