2

When I #include <windows.h> in C or C++ I am forced to decide the format of characters, where TCHAR either equals char or wchar_t.

I've looked around quite a bit and as far as posts such as this one or sites like this point out the wchar_t thing came about a long time ago before UTF8 and, for a variety of reasons, isn't a particularly good Unicode solution in modern programming. However these say nothing about support in existing systems already running in wchar_t.

So my question is, which one should I use? If I use plain old char will this be abandoned by MS in the future, since at the end of the day, the wchar_t version of the API is more recent? Or if I use wchar_t, will it be a pain to get my code running on other modern platforms, which developed later using plain old char in UTF8?

Community
  • 1
  • 1
c z
  • 7,726
  • 3
  • 46
  • 59
  • 1
    Unicode in Windows is `wchar_t`. Using the ANSI API's/`char` is deprecated and you should use the Unicode/`wchar_t` APIs for all new Windows applications. – MicroVirus Jun 02 '16 at 13:19
  • Windows is UTF16 which for MS VC means wchar_t, further reading: https://msdn.microsoft.com/en-us/library/c426s321.aspx – Richard Critten Jun 02 '16 at 13:20
  • Also see: http://stackoverflow.com/questions/3298569/difference-between-mbcs-and-utf-8-on-windows, http://stackoverflow.com/questions/1311953/starting-a-new-windows-app-should-i-use-tchar-or-wchar-t-for-text, http://stackoverflow.com/questions/10202969/converting-ascii-strings-to-utf-16-before-passing-them-to-windows-api-functions, http://stackoverflow.com/questions/24044676/wchar-t-vs-char-for-creating-an-api, and http://stackoverflow.com/questions/166503/utf-8-in-windows – Cody Gray - on strike Jun 02 '16 at 13:35
  • http://stackoverflow.com/questions/14132260/what-are-the-disadvantages-to-not-using-unicode-in-windows is certainly related but not a duplicate. Reviewed @Code Grey commented [references](http://stackoverflow.com/questions/37592975/is-wchar-t-useful-for-the-windows-api-anymore#comment62671631_37592975) which are also related. – chux - Reinstate Monica Jun 02 '16 at 14:22
  • The better question is: should I use `TCHAR` anymore or stick exclusively to `WCHAR`? ANSI/`char` functions are deprecated, as mentioned. `WCHAR`/`wchar_t` are the standard now, regardless of UTF16's merits or weaknesses. – jonspaceharper Jun 05 '16 at 09:01

1 Answers1

7

It is definitely useful and the only way to correctly handle arbitrary path names (since they are allowed to contain wide characters). The choice of UTF-16 is often criticized (with a good reason), but that's irrelevant. The OS uses it, so you have to use it, too. The best you can do is to always call the wide character version of WINAPI functions (e.g. OpenFileW) and use UTF-8 in your program internally. Yes, that means converting back-and-forth, but that usually isn't a performance bottleneck.

I strongly recommend the UTF-8 Manifesto which explains why objectively this is the best way to go.

Portability, cross-platform interoperability and simplicity are more important than interoperability with existing platform APIs. So, the best approach is to use UTF-8 narrow strings everywhere and convert them back and forth when using platform APIs that don’t support UTF-8 and accept wide strings (e.g. Windows API). Performance is seldom an issue of any relevance when dealing with string-accepting system APIs (e.g. UI code and file system APIs), and there is a great advantage to using the same encoding everywhere else in the application, so we see no sufficient reason to do otherwise.

Tamás Szelei
  • 23,169
  • 18
  • 105
  • 180
  • 2
    Why not just use UTF-16 throughout your program? That quote from the UTF-8 manifesto would read the same if you replaced every instance of UTF-8 with UTF-16, so it's neither an argument for or against, only an argument for a/any standard. – MicroVirus Jun 02 '16 at 13:24
  • 1
    microvirus: UTF-16 has the same drawbacks as UTF-8 (and UTF-32 for that matter) in that you can't simply iterate over a list of code points and/or code units and expect to get a single character and/or glyph and/or even valid data. The only advantage of UTF-16 over UTF-8 is the alleged memory savings for non-ASCII, non-Asian (I think) Unicode strings. But I bet that would be compensated by the memory savings that UTF-8 gets you in the ASCII range. Also: Mac/Linux/BSD since long natively use UTF-8 in their udnerlying OS APIs. So does POSIX. – rubenvb Jun 02 '16 at 13:27
  • 2
    @MicroVirus Because UTF-8 is objectively better and because non-MS operating systems use this encoding and the web uses this encoding. – Tamás Szelei Jun 02 '16 at 13:28
  • @MicroVirus Also, tons of other reasons which are not relevant to this question can be found in the linked manifesto. I highly recommend it, it's an outsandingly well-written piece. – Tamás Szelei Jun 02 '16 at 13:29
  • @rubenvb The advantage of UTF-16 over UTF-8 for a Windows program is clear: you can use UTF-16 throughout your program without having to convert it in between OS calls. For the rest, we seem to agree that UTF-8 and UTF-16 share the same drawbacks, so then UTF-16 is the better choice for Windows programs. – MicroVirus Jun 02 '16 at 13:30
  • 4
    Sure, all of the standards have drawbacks. The natural solution is to use whatever one your target platform implements natively. If you are truly writing cross-platform code, you need to pick one and standardize on that for your public API. But most people are not writing truly cross-platform code, and those concerns are not relevant in the back end where you're calling platform-specific APIs. The "UTF-8 Manifesto" is filled with warrantless claims and a lot of assumptions. Any subsystem dealing with platform APIs should stick with that standard. Write a translation layer if/when necessary. – Cody Gray - on strike Jun 02 '16 at 13:31
  • 6
    *"Because UTF-8 is objectively better and because non-MS operating systems use this encoding and the web uses this encoding."* - I don't see how you can draw the conclusion to *"use UTF-8 in your program internally"*. Certainly, no Linux distribution would ever care, what your Windows program does internally. On Windows, it usually is best to keep everything UTF-16 internally, and convert from/to UTF-8, as data enters/leaves the application (files, pipes, sockets, etc.). – IInspectable Jun 02 '16 at 13:35
  • 1
    @TamásSzelei I've browsed through the 'manifesto' and I honestly think it's garbage from a rhetorical point of view. It's many things, but 'objective' is not one of them. I actually even agree that UTF-8 is technically preferred over UTF-16 when not considering circumstances, but the choice has been made already, because Windows uses UTF-16. It doesn't warrant the effort of 'fighting' with Windows over this. – MicroVirus Jun 02 '16 at 13:42
  • For path names, UTF-8 of course is irrelevant. Your Linux computer isn't going to understand `L"C:\Program Files(x86)\Acme Inc\"` , regardless of the encoding used. The same holds for any string with a system-specific meaning. UTF-8 is used for _text_, typically in a natural language and with a meaning to humans. – MSalters Jun 02 '16 at 13:47
  • @MSalters: But it does understand (relative, forward slashes) filenames. Just like Mac/Windows. And Windows even understands `/` as the root of your drive. So the differences in filenames are limited to directory names. – rubenvb Jun 02 '16 at 14:15
  • 3
    @rubenvb: There are **lots** more differences (see [Naming Files, Paths, and Namespaces](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247.aspx), for a list of some of them), and some of the Windows file I/O functions will not translate `/` to `\\`, and user code can request no translation at all. If that weren't enough, NTFS doesn't even care about valid UTF-16 sequences. It will happily use anything you throw at it, as long as it's an even multiple of 16 bits. – IInspectable Jun 02 '16 at 14:30
  • @IInspectable To be fair, most filesystems on Linux will happily use more or less any null-terminated sequence of bytes you throw at it as a filename (treating the `/` character as a directory separator of course). – Ian Abbott Jun 02 '16 at 15:24
  • 1
    @IanAbbott: That's besides the point MSalters was trying to make: Some data has inherent system-specific semantics (e.g. filenames). Even if you encode a Windows pathname using UTF-8, it will be meaningless once transferred to a Linux system. While that destination system can perfectly decode it, it wouldn't know what to make of the drive letter part, for example. Or the backslashes. Or the "\\?\"-prefix. – IInspectable Jun 02 '16 at 15:56
  • @IInspectable I was really just commenting on your point about NTFS not caring about valid UTF-16 sequences, and how it is not uncommon for filesystems to care about charset encoding of filenames, but I wasn't disagreeing with your main point about portability of filenames. – Ian Abbott Jun 02 '16 at 16:13
  • 4
    If you are dealing with arbitrary pathnames, converting to UTF-8 and back is **dangerous**, for the reason IInspectable already pointed out. If you insist on doing so, you need to make sure you use a lossless conversion - one that will leave invalid UTF-16 sequences intact - and note that the Microsoft API Unicode functions are **not** lossless, probably because Microsoft wanted them to follow the standards. Most libraries are probably going to follow the standards too, so you'd probably have to roll your own converter to be safe. – Harry Johnston Jun 02 '16 at 22:05
  • This should be moved to chat. – jonspaceharper Jun 05 '16 at 09:02