89

I'm new to Windows programming and after reading the Petzold book I wonder: is it still good practice to use the TCHAR type and the _T() function to declare strings or should I just use the wchar_t and L"" strings in new code?

I will target only modern Windows (as of this writing versions 10 and 11) and my code will be i18n from the start up.

Fábio
  • 3,291
  • 5
  • 36
  • 49

12 Answers12

92

The short answer: NO.

Like all the others already wrote, a lot of programmers still use TCHARs and the corresponding functions. In my humble opinion the whole concept was a bad idea. UTF-16 string processing is a lot different than simple ASCII/MBCS string processing. If you use the same algorithms/functions with both of them (this is what the TCHAR idea is based on!), you get very bad performance on the UTF-16 version if you are doing a little bit more than simple string concatenation (like parsing etc.). The main reason are Surrogates.

With the sole exception when you really have to compile your application for a system which doesn't support Unicode I see no reason to use this baggage from the past in a new application.

Gutblender
  • 1,340
  • 1
  • 12
  • 25
Sascha
  • 1,410
  • 2
  • 10
  • 10
  • 8
    Fun fact: UTF-16 was not always there on the NT platform. Surrogate code points were introduced with Unicode 2.0, in 1996, which was the same year NT 4 got released. Up until, IIRC, (including) Windows 2000 all NT versions used UCS-2, effectively a subset of UTF-16 which assumed each character to be representable with one code point (i.e. no surrogates). – 0xC0000022L Jul 12 '12 at 16:00
  • 3
    btw, while I agree that `TCHAR` shouldn't be used anymore, I disagree that this was a bad idea. I also think that *if* you choose to be explicit instead of using `TCHAR` you should be explicit *everywhere*. I.e. not use functions with `TCHAR`/`_TCHAR` (such as `_tmain`) in their declaration either. Simply put: be consistent. +1, still. – 0xC0000022L Jul 12 '12 at 16:03
  • 3
    It **was a good idea** back when it was introduced, but it should be irrelevant in new code. – Adrian McCarthy Dec 04 '13 at 17:36
  • 4
    You misrepresent, what `TCHAR`s were initially introduced for: To ease development of code for Win 9x and Windows NT based versions of Windows. At that time, Windows NT's UTF-16 implementation was UCS-2, and the algorithms for string parsing/manipulation were identical. There were no surrogates. And even with surrogates, algorithms for DBCS (the only supported MBCS encoding for Windows) and UTF-16 are the same: In either encoding, a code point consists of one or two code units. – IInspectable Nov 21 '15 at 16:15
  • Suppose I want to use FormatMessage() to convert a value from WSAGetLastError() to something printable. The documentation for WSAGetLastError() says it takes LPTSTR as the pointer to the buffer. I really don't have much choice but to use TCHAR, no? – Edward Falk Aug 04 '16 at 08:01
  • @EdwardFalk: `WSAGetLastError` doesn't take any arguments, so I'm assuming that you're referring to [FormatMessage](https://msdn.microsoft.com/en-us/library/windows/desktop/ms679351.aspx). As the documentation points out, there is a Unicode export, `FormatMessageW`, that takes an `LPWSTR`. No need to use the generic-text mappings. This is true for almost all Windows API calls that take string arguments. – IInspectable Dec 02 '16 at 13:18
82

I have to agree with Sascha. The underlying premise of TCHAR / _T() / etc. is that you can write an "ANSI"-based application and then magically give it Unicode support by defining a macro. But this is based on several bad assumptions:

That you actively build both MBCS and Unicode versions of your software

Otherwise, you will slip up and use ordinary char* strings in many places.

That you don't use non-ASCII backslash escapes in _T("...") literals

Unless your "ANSI" encoding happens to be ISO-8859-1, the resulting char* and wchar_t* literals won't represent the same characters.

That UTF-16 strings are used just like "ANSI" strings

They're not. Unicode introduces several concepts that don't exist in most legacy character encodings. Surrogates. Combining characters. Normalization. Conditional and language-sensitive casing rules.

And perhaps most importantly, the fact that UTF-16 is rarely saved on disk or sent over the Internet: UTF-8 tends to be preferred for external representation.

That your application doesn't use the Internet

(Now, this may be a valid assumption for your software, but...)

The web runs on UTF-8 and a plethora of rarer encodings. The TCHAR concept only recognizes two: "ANSI" (which can't be UTF-8) and "Unicode" (UTF-16). It may be useful for making your Windows API calls Unicode-aware, but it's damned useless for making your web and e-mail apps Unicode-aware.

That you use no non-Microsoft libraries

Nobody else uses TCHAR. Poco uses std::string and UTF-8. SQLite has UTF-8 and UTF-16 versions of its API, but no TCHAR. TCHAR isn't even in the standard library, so no std::tcout unless you want to define it yourself.

What I recommend instead of TCHAR

Forget that "ANSI" encodings exist, except for when you need to read a file that isn't valid UTF-8. Forget about TCHAR too. Always call the "W" version of Windows API functions. #define _UNICODE just to make sure you don't accidentally call an "A" function.

Always use UTF encodings for strings: UTF-8 for char strings and UTF-16 (on Windows) or UTF-32 (on Unix-like systems) for wchar_t strings. typedef UTF16 and UTF32 character types to avoid platform differences.

Community
  • 1
  • 1
dan04
  • 87,747
  • 23
  • 163
  • 198
  • 6
    2012 calling: there are still applications to be maintained without `#define _UNICODE` even now. End of transmission :) – 0xC0000022L Jul 12 '12 at 15:57
  • 12
    @0xC0000022L the question was about *new* code. When you maintain old code, you obviously have to work with the environment *that* code is written for. If you're maintaining a COBOL application, then it doesn't matter if COBOL is a good language or not, you're stuck with it. And if you're maintaining an application which relies on TCHAR then it doesn't matter if that was a good decision or not, you're stuck with it. – jalf Oct 21 '12 at 09:01
  • 2
    Indeed, TCHAR is not useful unless in COBOL) – Pavel Radzivilovsky Nov 01 '12 at 06:01
  • 2
    `_UNICODE` controls how the generic-text mappings are resolved in the CRT. If you don't want to call the ANSI version of a Windows API, you need to define `UNICODE`. – IInspectable Jul 06 '16 at 18:45
20

If you're wondering if it's still in practice, then yes - it is still used quite a bit. No one will look at your code funny if it uses TCHAR and _T(""). The project I'm working on now is converting from ANSI to unicode - and we're going the portable (TCHAR) route.

However...

My vote would be to forget all the ANSI/UNICODE portable macros (TCHAR, _T(""), and all the _tXXXXXX calls, etc...) and just assume unicode everywhere. I really don't see the point of being portable if you'll never need an ANSI version. I would use all the wide character functions and types directly. Preprend all string literals with a L.

Aardvark
  • 8,474
  • 7
  • 46
  • 64
  • 3
    You might write some code you'll want to use somewhere else where you do need an ANSI version, or (as Nick said) Windows might move to DCHAR or whatever, so I still think it's a very good idea to go with TCHAR instead of WCHAR. – Chris Walton Mar 10 '10 at 23:34
  • I doubt that Windows will ever switch to UTF-32. – dan04 Oct 22 '12 at 13:50
15

I would still use the TCHAR syntax if I was doing a new project today. There's not much practical difference between using it and the WCHAR syntax, and I prefer code which is explicit in what the character type is. Since most API functions and helper objects take/use TCHAR types (e.g.: CString), it just makes sense to use it. Plus it gives you flexibility if you decide to use the code in an ASCII app at some point, or if Windows ever evolves to Unicode32, etc.

If you decide to go the WCHAR route, I would be explicit about it. That is, use CStringW instead of CString, and casting macros when converting to TCHAR (eg: CW2CT).

That's my opinion, anyway.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Nick
  • 6,808
  • 1
  • 22
  • 34
  • 1
    Indeed, that's what will still work when the character encoding is eventually changed ''again''. – Medinoc Sep 16 '14 at 14:21
  • 12
    You prefer code which is explicit in what the character type is, and thus use a type which is sometimes this and sometimes that? Very persuasive. – Deduplicator Jan 13 '15 at 04:20
  • 4
    **−1** for the inconsistency noted by @Deduplicator, and for the negative payoff advice to use a macro that can be whatever (and will generally not be tested for more than one specific value). – Cheers and hth. - Alf Jul 06 '16 at 17:57
  • "I prefer code which is explicit in what the character type is" -- but TCHAR _isn't_ a type, it's a preprocessor symbol. Just use wchar_t and the W-suffix of Windows API functions. Explicit AND typesafe. – Scott Smith Aug 16 '22 at 14:26
12

I would like to suggest a different approach (neither of the two).

To summarize, use char* and std::string, assuming UTF-8 encoding, and do the conversions to UTF-16 only when wrapping API functions.

More information and justification for this approach in Windows programs can be found in http://www.utf8everywhere.org.

Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
  • @PavelRadzivilovsky, when implementing your suggestion in a VC++ application, would we set the VC++ charachter set to 'None' or 'Multibyte (MBCS)'? The reason I am asking is that I just installed Boost::Locale and the default character set was MBCS. FWIW, my pure ASCII application was set to 'None' and I have now set it to 'MBCS' (since I will be using Boost::Locale in it) and it works just fine. Please advise. – Caroline Beltran Sep 21 '14 at 23:04
  • As utf8everywhere recommends, I would set it to 'Use Unicode character set'. This ads extra safety, but is not required. Boost::locale's author is a very smart guy, I am sure he did the right thing though. – Pavel Radzivilovsky Sep 22 '14 at 14:52
  • 4
    The *UTF-8 Everywhere* mantra won't become the right solution, just because it is repeated more often. UTF-8 is undoubtedly an attractive encoding for serialization (e.g. files, or network sockets), but on Windows it is frequently more appropriate, to store character data using the native UTF-16 encoding internally, and convert at the application boundary. One reason is, that UTF-16 is the only encoding, that can be converted immediately to any other supported encoding. This is not the case with UTF-8. – IInspectable Dec 02 '16 at 13:52
  • "..UTF-16 is the only encoding, that can be converted immediately to any other supported encoding." what do you mean? What's the problem to convert UTF-8 encoding to anything else? – Pavel Radzivilovsky Dec 03 '16 at 10:30
  • 2
    @PavelRadzivilovsky: *"What's the problem to convert UTF-8 encoding to anything else?"* - That's not what I said. You can immediately convert UTF-8 to UTF-16, calling `MultiByteToWideChar`. But you cannot convert from UTF-8 to anything else, without first converting to UTF-16. – IInspectable Dec 05 '16 at 19:06
  • 3
    I do not understand. To anything else - like what? E.g. UCS-4? Why not? Seems very easy, all numeric algorithm.. – Pavel Radzivilovsky Dec 09 '16 at 18:58
  • @PavelRadzivilovsky IInspectable's point is that neither the Windows API nor the CRT offer UTF-8-to-CP#### functions, they only provide UTF-16-to-CP#### functions. So if you need to convert UTF-8 to any non-unicode encoding, you have to do it in two steps, converting first to UTF-16 and only then to the non-unicode encoding of your interlocutor's choice. – Medinoc Mar 01 '23 at 09:22
  • @Medinoc No you don't have to. Converting is a function that does not need side effects, you don't need the Windows API nor any other API to convert a encoding (except the memory allocations). There is nothing that stops you from converting it directly. (Having said that, it is very rare that you need to do that, just use UTF8 and convert it to UTF-16 for the Windows API when needed. Why bother with any other encoding? Except for the rare cases some format has a different encoding). – 12431234123412341234123 Jul 04 '23 at 14:50
11

The Introduction to Windows Programming article on MSDN says

New applications should always call the Unicode versions (of the API).

The TEXT and TCHAR macros are less useful today, because all applications should use Unicode.

I would stick to wchar_t and L"".

Community
  • 1
  • 1
Steven
  • 2,538
  • 3
  • 31
  • 40
  • 4
    Steven, you are quoting a text written by someone who does not understand the meaning of the word 'Unicode'. It is one of those unfortunate documents from the time of UCS-2 confusion. – Pavel Radzivilovsky Nov 01 '12 at 06:03
  • 3
    @PavelRadzivilovsky: The document was written for a system, where *Unicode* and *UTF-16LE* are commonly used interchangeably. While technically inaccurate, it is unambiguous nonetheless. This is also explicitly pointed out in the introduction of the same text: *"Windows represents Unicode characters using UTF-16 encoding [...]"*. – IInspectable Dec 02 '16 at 13:35
9

TCHAR/WCHAR might be enough for some legacy projects. But for new applications, I would say NO.

All these TCHAR/WCHAR stuff are there because of historical reasons. TCHAR provides a seemly neat way (disguise) to switch between ANSI text encoding (MBCS) and Unicode text encoding (UTF-16). In the past, people did not have an understanding of the number of characters of all the languages in the world. They assumed 2 bytes were enough to represent all characters and thus having a fixed-length character encoding scheme using WCHAR. However, this is no longer true after the release of Unicode 2.0 in 1996.

That is to say: No matter which you use in CHAR/WCHAR/TCHAR, the text processing part in your program should be able to handle variable length characters for internationalization.

So you actually need to do more than choosing one from CHAR/WCHAR/TCHAR for programming in Windows:

  1. If your application is small and does not involve text processing (i.e. just passing around the text string as arguments), then stick with WCHAR. Since it is easier this way to work with WinAPI with Unicode support.
  2. Otherwise, I would suggest using UTF-8 as internal encoding and store texts in char strings or std::string. And covert them to UTF-16 when calling WinAPI. UTF-8 is now the dominant encoding and there are lots of handy libraries and tools to process UTF-8 strings.

Check out this wonderful website for more in-depth reading: http://utf8everywhere.org/

LeOpArD
  • 488
  • 2
  • 6
  • 14
  • 5
    *"UTF-8 is now the dominant encoding"* - This turned wrong, by leaving out the second part of the quote (*"for the World Wide Web"*). For desktop applications, the most used native character encoding is likely still UTF-16. Windows uses it, Mac OS X does, too, and so do .NET's and Java's string types. That accounts for a **massive** amount of code out there. Don't get me wrong, there's nothing wrong with UTF-8 for serialization. But more often than not (especially on Windows), you'll find, that using UTF-16 internally is more appropriate. – IInspectable Dec 07 '16 at 15:03
5

Yes, absolutely; at least for the _T macro. I'm not so sure about the wide-character stuff, though.

The reason being is to better support WinCE or other non-standard Windows platforms. If you're 100% certain that your code will remain on NT, then you can probably just use regular C-string declarations. However, it's best to tend towards the more flexible approach, as it's much easier to #define that macro away on a non-windows platform in comparison to going through thousands of lines of code and adding it everywhere in case you need to port some library to windows mobile.

Nik Reiman
  • 39,067
  • 29
  • 104
  • 160
  • 1
    WinCE uses 16-bit wchar_t strings just like Win32. We have a large base of code that runs on WinCE and Win32 and we never use TCHAR. – mhenry1384 Jun 21 '10 at 21:30
3

IMHO, if there's TCHARs in your code, you're working at the wrong level of abstraction.

Use whatever string type is most convenient for you when dealing with text processing - this will hopefully be something supporting unicode, but that's up to you. Do conversion at OS API boundaries as necessary.

When dealing with file paths, whip up your own custom type instead of using strings. This will allow you OS-independent path separators, will give you an easier interface to code against than manual string concatenation and splitting, and will be a lot easier to adapt to different OSes (ansi, ucs-2, utf-8, whatever).

snemarch
  • 4,958
  • 26
  • 38
  • Unicode has at least three current encodings (UTF-8, UTF-16, UTF-32) and one deprecated encoding (UCS-2, a subset of what is now UTF-16). Which one do you refer to? I like the rest of the suggestions though +1 – 0xC0000022L Jul 12 '12 at 15:55
  • And of course with c++17 and later, this recommendation is outdated, given that `std::filesystem` now exists. – Spencer Feb 28 '23 at 16:33
2

The only reasons I see to use anything other than the explicit WCHAR are portability and efficiency.

If you want to make your final executable as small as possible use char.

If you don't care about RAM usage and want internationalization to be as easy as simple translation, use WCHAR.

If you want to make your code flexible, use TCHAR.

If you only plan on using the Latin characters, you might as well use the ASCII/MBCS strings so that your user does not need as much RAM.

For people who are "i18n from the start up", save yourself the source code space and simply use all of the Unicode functions.

Trololol
  • 29
  • 1
-1

TCHAR have a new meaning to port from WCHAR to CHAR.

https://learn.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page

Recent releases of Windows 10 have used the ANSI code page and -A APIs as a means to introduce UTF-8 support to apps. If the ANSI code page is configured for UTF-8, -A APIs operate in UTF-8.

OwnageIsMagic
  • 1,949
  • 1
  • 16
  • 31
  • `TCHAR` has still the same meaning. Nothing changed. No one uses it. If you wish to take advantage of the UTF-8 codepage for ANSI versions of API calls, call the `-A` API explicitly. That said, there are many APIs for which there is no ANSI version (e.g. `CommandLineToArgvW`) or that have restrictions over the Unicode versions (e.g. `CreateFile`). And COM and the Windows Runtime use `BSTR`'s and `HSTRING`'s, respectively, that don't have a UTF-8 equivalent. – IInspectable Feb 28 '23 at 11:30
  • UTF-8 support for -A APIs? Awesome! This was pretty much the main obstacle to using UTF-8 on Windows, making TCHAR just as relevant now as it was back then (But keep in mind IInspectable's comment in Pavel Radzivilovsky's answer https://stackoverflow.com/a/8991631/1455631 ). – Medinoc Mar 01 '23 at 09:26
-1

TCHAR is not relevant anymore, since now we have UNICODE. You should use UTF-16 wchar_t* strings instead.

Windows APIs takes wchar_t* as strings, and it is UTF-16.

thebluetropics
  • 82
  • 2
  • 10
  • "UTF-16 `wchar_t`". No, that doesn't make sense. `wchar_t` isn't defined as UTF-16. That would be `char16_t`, available if `__cpp_unicode_characters` is defined. `wchar_t` is generally UTF-32 on Linux and similar systems. And if you wanted to be specific for just Windows, the UTF-16 type would be `WCHAR`. – MSalters Oct 18 '22 at 14:38
  • @MSalters `WCHAR` is `wchar_t`. And `wchar_t` is UTF-16 on windows, as Windows allows only 2^16 possible value for a single `wchar_t`. Being said that, this context is for Windows APIs. Sorry for omitting the context, but it is obvious that this question has `windows` tag... – thebluetropics Oct 19 '22 at 09:01
  • You might be mixing Windows, C++ (the language) and VC++ (Microsoft's compiler) here. There typically is a mapping between Windows and specific compilers targeting Windows. Windows defines WCHAR as 16 bits UTF-16; compilers have to pick a type that matches. C++ allows `wchar_t` to be 16 bits, but formally it can't be UTF-16 (since C++ requires it to be fixed-width, and UTF-16 has surrogate pairs). That's why gcc defines it as 32 bits UTF-32, which is fixed-width. – MSalters Oct 20 '22 at 06:43