3

What are TCHAR strings, such as LPTSTR and LPCTSTR and how can I work with these? When I create a new project in Visual Studio it creates this code for me:

#include <tchar.h>

int _tmain(int argc, _TCHAR* argv[])
{
   return 0;
}

How can I, for instance, concatenate all the command line arguments?

If I'd want to open a file with the name given by the first command line argument, how can I do this? The Windows API defines 'A' and 'W' versions of many of its functions, such as CreateFile, CreateFileA and CreateFileW; so how do these differ from one another and which one should I use?

Jonathan Potter
  • 36,172
  • 4
  • 64
  • 79
MicroVirus
  • 5,324
  • 2
  • 28
  • 53
  • 1
    I find myself writing the gist of this Q/A set often enough in questions coming up. I want to start using this as my standard reference when such a need pops up. Any improvements are welcomed; feel free to edit or add your own answer(s). – MicroVirus Nov 20 '15 at 21:56
  • @AdrianMcCarthy Nice reference! I didn't come across that one in my search. – MicroVirus Nov 20 '15 at 23:21
  • I'm on the fence about calling it a duplicate. While it covers much of the same ground, this question will be easier to find by many people interested in this topic, and the answer is excellent. – Adrian McCarthy Nov 20 '15 at 23:28
  • @AdrianMcCarthy I'm okay with it being a duplicate. Just because it's a duplicate, doesn't mean that either is bad. Now the questions are linked, so people can read both and decide which they like. I do think my question is slightly different, but your answer is also excellent. I wrote mine to be as newbie friendly as I could manage it. Had I seen that Q&A earlier, I would've added my answer to that one, I think. – MicroVirus Nov 20 '15 at 23:30
  • Linking them is better. Readers get both – David Heffernan Nov 20 '15 at 23:50

1 Answers1

7

Let me start off by saying that you should preferably not use TCHAR for new Windows projects and instead directly use Unicode. On to the actual answer:

Character Sets

The first thing we need to understand is how character sets work in Visual Studio. The project property page has an option to select the character set used:

  • Not Set
  • Use Unicode Character Set
  • Use Multi-Byte Character Set

Project Property Page - Character Set

Depending on which of the three option you choose, a lot of definitions change to accommodate the selected character set. There are three main classes: strings, string routines from tchar.h, and API functions:

  • 'Not Set' corresponds to TCHAR = char using ANSI encoding, where you use the standard 8-bit code page of the system for strings. All tchar.h string routines use the basic char versions. All API functions that work with strings will use the 'A' version of the API function.
  • 'Unicode' corresponds to TCHAR = wchar_t using UTF-16 encoding. All tchar.h string routines use the wchar_t versions. All API functions that work with strings will use the 'W' version of the API function.
  • 'Multi-Byte' corresponds to TCHAR = char, using some multi-byte encoding scheme. All tchar.h string routines use the multi-byte character set versions. All API functions that work with strings will use the 'A' version of the API function.

Related reading: About the "Character set" option in visual studio 2010

TCHAR.h header

The tchar.h header is a helper for using generic names for the C string operations on strings, that switch to the correct function for the given character set. For instance, _tcscat will switch to either strcat (not set), wcscat (unicode), or _mbscat (mbcs). _tcslen will switch to either strlen (not set), wcslen (unicode), or strlen (mbcs).

The switch happens by defining all _txxx symbols as macro's that evaluate to the correct function, depending on the compiler switches.

The idea behind it is that you can use the encoding-agnostic types TCHAR (or _TCHAR) and the encoding-agnostic functions that work on them, from tchar.h, instead of the regular string functions from string.h.

Similarly, _tmain is defined to be either main or wmain. See also: What is the difference between _tmain() and main() in C++?

A helper macro _T(..) is defined for getting string literals of the correct type, either "regular literals" or L"wchar_t literals".

See the caveats mentioned here: Is TCHAR still relevant? -- dan04's answer

_tmain example

For the example of main in the question, the following code concatenates all the strings passed as command line arguments into one.

int _tmain(int argc, _TCHAR *argv[])
{
   TCHAR szCommandLine[1024];

   if (argc < 2) return 0;

   _tcscpy(szCommandLine, argv[1]);
   for (int i = 2; i < argc; ++i)
   {
       _tcscat(szCommandLine, _T(" "));
       _tcscat(szCommandLine, argv[i]);
   }

   /* szCommandLine now contains the command line arguments */

   return 0;
}

(Error checking is omitted) This code works for all three cases of the character set, because everywhere we used TCHAR, the tchar.h string functions and _T for string literals. Forgetting to surround your string literals with _T(..) is a common source of compiler errors when writing such TCHAR-programs. If we had not done all these things, then switching character sets would cause the code to either not compile, or worse, compile but misbehave during runtime.

Windows API functions

Windows API functions that work on strings, such as CreateFile and GetCurrentDirectory, are implemented in the Windows headers as macro's that, like the tchar.h macro's, switch to either the 'A' version or 'W' version. For instance, CreateFile is a macro that is defined to CreateFileA for ANSI and MBCS, and to CreateFileW for Unicode.

Whenever you use the flat form (without 'A' or 'W') in your code, the actual function called will switch depending on the selected character set. You can force the use of a particular version by using the explicit 'A' or 'W' names.

The conclusion is that you should always use the unqualified name, unless you want to always refer to a specific version, independently of the character set option.

For the example in the question, where we want to open the file given by the first argument:

int _tmain(int argc, _TCHAR *argv[])
{  
   if (argc < 2) return 1;

   HANDLE hFile = CreateFile(argv[1], GENERIC_READ, 0, NULL, OPEN_EXISTING, 0, NULL);

   /* Read from file and do other stuff */
   ...

   CloseHandle(hFile);

   return 0;
}

(Error checking is omitted) Note that for this example, nowhere we needed to use any of the TCHAR specific stuff, because the macro definitions have already taken care of this for us.

Utilising C++ strings

We've seen how we can use the tchar.h routines to use C style string operations to work with TCHARs, but it would be nice if we could leverage C++ strings to work with this.

My advice would foremost be to not use TCHAR and instead use Unicode directly, see the Conclusion section, but if you want to work with TCHAR you can do the following.

To use TCHAR, what we want is an instance of std::basic_string that uses TCHAR. You can do this by typedefing your own tstring:

typedef std::basic_string<TCHAR> tstring;

For string literals, don't forget to use _T.

You'll also need to use the correct versions of cin and cout. You can use references to implement a tcin and tcout:

#if defined(_UNICODE)
std::wistream &tcin = wcin;
std::wostream &tcout = wcout;
#else
std::istream &tcin = cin;
std::ostream &tcout = cout;
#end

This should allow you to do almost anything. There might be the occasional exception, such as std::to_string and std::to_wstring, for which you can find a similar workaround.

Conclusion

This answer (hopefully) details what TCHAR is and how it's used and intertwined with Visual Studio and the Windows headers. However, we should also wonder if we want to use it.

My advice is to directly use Unicode for all new Windows programs and don't use TCHAR at all!

Others giving the same advice: Is TCHAR still relevant?

To use Unicode after creating a new project, first ensure the character set is set to Unicode. Then, remove the #include <tchar.h> from your source file (or from stdafx.h). Fix up any TCHAR or _TCHAR to wchar_t and _tmain to wmain:

int wmain(int argc, wchar_t *argv[])

For non-console projects, the entry point for Windows applications is WinMain and will appear in TCHAR-jargon as

int APIENTRY _tWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPTSTR    lpCmdLine, int nCmdShow)

and should become

int APIENTRY wWinMain(HINSTANCE hInstance, HINSTANCE hPrevInstance, LPWSTR    lpCmdLine, int nCmdShow)

After this, only use wchar_t strings and/or std::wstrings.

Further caveats

  • Be careful when writing sizeof(szMyString) when using TCHAR arrays (strings), because for ANSI this is the size both in characters and in bytes, for Unicode this is only the size in bytes and the number of characters is at most half, and for MBCS this is the size in bytes and the number of characters may or may not be equal. Both Unicode and MBCS can use multiple TCHARs to encode a single character.
  • Mixing TCHAR stuff and fixed char or wchar_t is very annoying; you have to convert the strings from one to the other, using the correct code page! A simple copy will not work in the general case.
  • There is a slight difference between _UNICODE and UNICODE, relevant if you want to conditionally define your own functions. See Why both UNICODE and _UNICODE?

A very good, complementary answer is: Difference between MBCS and UTF-8 on Windows

Community
  • 1
  • 1
MicroVirus
  • 5,324
  • 2
  • 28
  • 53
  • Totally disagree with Unicode as a default choice. It is so 20th century. Everybody should go for UTF-8. – SergeyA Nov 20 '15 at 22:09
  • 3
    @SergeyA Well, that'd be fine, except for that Windows does not support UTF-8. – MicroVirus Nov 20 '15 at 22:10
  • are you kidding me? I am reading this page in Windows, using UTF-8 as an encoding. – SergeyA Nov 20 '15 at 22:12
  • 4
    @SergeyA The whole windows *API* doesn't support UTF8. The browser is free to implement a conversion from UTF8 data from the net to UTF16 itself (and free to chosse a GUI lib / render engine which works with UTF8 too). – deviantfan Nov 20 '15 at 22:14
  • @SergeyA And no, we're not kidding you. Windows doesn't support UTF8, and yes it's sad. – deviantfan Nov 20 '15 at 22:14
  • OMG. Every day something new. Sorry guys, had no idea. – SergeyA Nov 20 '15 at 22:15
  • 1
    @SergeyA Hence, the Unicode recommendation (it comes directly from Microsoft too): you can either use a code page, or if you want unicode support then you have to choose UTF-16 on the API level. – MicroVirus Nov 20 '15 at 22:15
  • One thing I would add is that sometimes TCHAR/tstrings can be useful for cross-platform development. If you want to support the native string encoding on all platforms, TCHAR is one method of supporting UTF-16 on Windows while also supporting UTF-8 on Mac and Linux. Not really what it was _intended_ to be used for, but I've seen it used this way out in the wild. – MrEricSir Nov 20 '15 at 22:33
  • Another bit that confuses people are the `UNICODE` and `_UNICODE` macros. `UNICODE` means the general Windows API names will use UTF-16. `_UNICODE` means the C and C++ language runtime libraries will use UTF-16 for the `tcs` APIs. – Adrian McCarthy Nov 20 '15 at 23:26
  • @AdrianMcCarthy I came across that yeah, I added a good SO link in the Caveats section. – MicroVirus Nov 20 '15 at 23:29
  • @SergeyA: You *"disagree with Unicode as a default choice"*, and then continue to suggest to use Unicode instead. This doesn't make sense. That's probably, because you don't understand the difference between character sets and character encoding. Mandatory reading: [The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)](http://www.joelonsoftware.com/articles/Unicode.html). – IInspectable Nov 21 '15 at 15:32
  • @deviantfan: Windows has **full** support for UTF-8 (see [Unicode](https://msdn.microsoft.com/en-us/library/windows/desktop/dd374081.aspx)), and it is used in several places. The .NET metadata compiled into assemblies is UTF-8 encoded, for example. The Windows **API**, on the other hand, opted for UTF-16 encoding, for obvious reasons. You cannot choose something (UTF-8) that doesn't exist. – IInspectable Nov 21 '15 at 15:36
  • Small correction: When *Character set* is set to *Not set*, the project defaults to **ASCII** encoding (not ANSI) (see [Generic-Text Mappings in Tchar.h](https://msdn.microsoft.com/en-us/library/c426s321.aspx)). – IInspectable Nov 21 '15 at 15:45
  • @IInspectable The first paragraph of [Text and Strings in Visual C++](https://msdn.microsoft.com/en-us/library/06b9yaeb.aspx) seems to imply that the SBCS (that is the Not Set) using the code page ('ANSI'). I have noted that MSDN is somewhat sloppy with such details, so I admit I am not 100% sure which of the two to believe. Most likely it would be that Not Set uses the full 8-bit code page; that's at least how it works with console applications in Visual Studio. – MicroVirus Nov 21 '15 at 16:10
  • @MicroVirus: Agreed, the MSDN does seem to contradict itself. I suppose the common denominator is, that *Not set* defaults to SBCS encoding (whatever that is). – IInspectable Nov 21 '15 at 16:19
  • @IInspectable That's what I wrote, or not? (that the API is limited, not the whole OS) – deviantfan Nov 21 '15 at 17:21
  • @deviantfan: *"Windows doesn't support UTF8"* is the opposite of saying *"Windows has full support for UTF-8"*. [You wrote the former](http://stackoverflow.com/questions/33836706/what-are-tchar-strings-and-the-a-or-w-version-of-win32-api-functions/33836707?noredirect=1#comment55439286_33836707). Your other quote (*"The whole windows API doesn't support UTF8"*) is equally wrong. [MultiByteToWideChar](https://msdn.microsoft.com/en-us/library/windows/desktop/dd319072.aspx) has UTF-8 support. This is part of the Windows API. – IInspectable Nov 21 '15 at 17:48
  • @IInspectable Windows really doesn't support UTF-8 in any reasonable way, see for instance [Setting UTF8 as default Character Encoding in Windows 7](http://superuser.com/questions/239810/setting-utf8-as-default-character-encoding-in-windows-7). And if you want to convert a SBCS or MBCS to UTF-8 you need to first convert it to UTF-16 and then convert to UTF-8; there's no direct way to convert to UTF-8. So Windows has two API functions that can encode as UTF-8, yes, but UTF-8 is not a valid code page for the 'A' functions. – MicroVirus Nov 23 '15 at 12:04
  • Windows doesn't use UTF-8 internally, but has full support to digest and produce UTF-8 encoded data. If you want to interact with the API directly you have to convert to it's internal representation first. If the 'A' functions could be used with codepage 65001, then this conversion would happen just as well. Since you would change your mind and claim that Windows has full UTF-8 support if that were possible, I don't understand why you insist that it doesn't, if the conversion is visible in source code, and performed by means of the Windows API only. – IInspectable Nov 23 '15 at 12:27