20

I have a question:

Some libraries use WCHAR as the text parameter and others use CHAR (as UTF-8): I need to know when to use WCHAR or CHAR when I write my own library.

Mr.C64
  • 41,637
  • 14
  • 86
  • 162
user2179256
  • 659
  • 2
  • 9
  • 21
  • WCHAR stands for wide char, usually used when dealing with the UNICODE encoding style of text AFAIK – sumitb.mdi Apr 17 '14 at 15:11
  • There is no `WCHAR` in C++. Do you mean the `WCHAR` macro defined by the Windows headers? – David Heffernan Apr 17 '14 at 15:25
  • @DavidHeffernan: I assumed he meant WCHAR of Win32 headers (in fact, I was thinking of editing the OP's tags adding [winapi] :) – Mr.C64 Apr 17 '14 at 15:27
  • WCHAR isn't always Unicode - it could be DBCS, especially when dealing with character sets like Shift-JIS and BIG5. – cup Apr 17 '14 at 15:55

5 Answers5

26

Use char and treat it as UTF-8. There are a great many reasons for this; this website summarises it much better than I can:

http://utf8everywhere.org/

It recommends converting from wchar_t to char (UTF-16 to UTF-8) as soon as you receive it from any library, and converting back when you need to pass strings to it. So to answer your question, always use char except at the point that an API requires you to pass or receive wchar_t.

Ben Hymers
  • 25,586
  • 16
  • 59
  • 84
  • 5
    Actually, it says to use UTF-8 "if an application is not supposed to specialize in text". I tend to use UTF-8 pretty much everywhere, but I'm not sure it would be appropriate in an editor, for example. Things like regular expressions will be significantly slower if you use UTF-8, for example. – James Kanze Apr 17 '14 at 16:43
  • Good point; you're right. It sounds like this won't make a difference to the OP but it's worth mentioning. – Ben Hymers Apr 18 '14 at 18:17
  • @BenHymers Are you trying to say that UTF-8 takes 8 bits to encode any UNICODE codepoint? For example, code point range, `U+10000-U+10FFFF`, UTF-8 says, it takes 4 bytes to encode a codepoint. I have no idea on the meaning, when u say, take `char` and treat it as UTF-8. How can `char` store UTF-8 encoding? – overexchange Nov 12 '16 at 21:01
  • @overexchange No, I'm definitely not saying that, I don't think anyone would accept such an answer :) We're talking about strings rather than single characters - so an array of `char` rather than a single `char` - I just omitted 'array of' for brevity as it's implied in this context. Each `char` is one byte of a UTF-8 symbol which may be one byte or more. – Ben Hymers Nov 14 '16 at 09:24
  • @overexchange Are you asking what an array of `char` looks like? It can be as simple as `char*`. I don't think that's what you really mean though since anyone quoting codepoint ranges must surely know what an array is ;) Can you be more specific? – Ben Hymers Nov 17 '16 at 22:28
  • @BenHymers you mean `char*` pointing to string? for each UTF-8 code character? – overexchange Nov 17 '16 at 22:30
  • @overexchange That's one representation of a sequence of bytes in C++, yes - there are others (e.g. `std::string`). The `char*` doesn't "point to a string", it points to the first character of a string, and the next memory address is the next character and so on, usually (by convention) terminated by a null (0) character. If this string is UTF-8, each of these `char`s is one byte which may be a whole Unicode char, or part of one. Honestly though, without intending to cause offence, you need to go back to the very basics of C++ before trying to understand how to represent Unicode in it! – Ben Hymers Nov 17 '16 at 22:41
  • @BenHymers For knowing to simply use `char*`, why `wchar` was introduce, when `wchar` is not portable? – overexchange Nov 17 '16 at 22:49
  • @overexchange You'll have to ask whoever added it to the standard! I believe it's a mistake, and so do many others: http://utf8everywhere.org/ - the reason is probably that somebody once thought "two bytes per character should be plenty". – Ben Hymers Nov 17 '16 at 22:59
  • @BenHymers So I should ignore reading such [documentations](https://www.gnu.org/software/libc/manual/html_node/Character-Set-Handling.html#Character-Set-Handling) provided by GNUC. – overexchange Nov 17 '16 at 23:05
  • @overexchange What are you doing? Are you trolling me? You're asking extremely basic questions then referencing detailed documents. I've been very patient so far but now it looks like you're just passive-aggresively trying to make yourself look smarter than me, not genuinely asking for help. The GNU documentation you provide is talking about GNU C specifically, where `wchar_t` is 32 bits - this isn't standard and can't be relied on on other platforms and compilers. I'm not going to reply any more - apparently I'm wasting my time since you're knee-deep in documentation already. Good luck. – Ben Hymers Nov 21 '16 at 10:46
  • @BenHymers No I was just seeking help in understanding better. I did not expect(assume) that you would get bothered so badly. I apologize for this. – overexchange Nov 21 '16 at 17:44
  • the advise in this thread is just WRONG... because you can't generalize a practice for all use cases. The real advise is it depends... I wrote a string class that, internally, uses wchar_t so I can convert to unicode if needed. The public API accepts char * and converts to wchar_t (internally) And while this class works for most of my work, I don't think I would use it on an embedded platform where space and speed are typically primary objectives. Desktop and even mobile apps my wchar_t based library is OK (again depends) but for code running on a small M4 ARM, probably not. – Eric Feb 12 '21 at 15:21
11

WCHAR (or wchar_t on Visual C++ compiler) is used for Unicode UTF-16 strings.
This is the "native" string encoding used by Win32 APIs.

CHAR (or char) can be used for several other string formats: ANSI, MBCS, UTF-8.

Since UTF-16 is the native encoding of Win32 APIs, you may want to use WCHAR (and better a proper string class based on it, like std::wstring) at the Win32 API boundary, inside your app.

And you can use UTF-8 (so, CHAR/char and std::string) to exchange your Unicode text outside your application boundary. For example: UTF-8 is widely used on the Internet, and when you exchange UTF-8 text between different platforms you don't have the problem of endianness (instead with UTF-16 you have to consider both the UTF-16BE big-endian and the UTF-16LE little-endian cases).

You can convert between UTF-16 and UTF-8 using the WideCharToMultiByte() and MultiByteToWideChar() Win32 APIs. These are pure-C APIs, and these can be conveniently wrapped in C++ code, using string classes instead of raw character pointers, and exceptions instead of raw error codes. You can find an example of that here.

Mr.C64
  • 41,637
  • 14
  • 86
  • 162
  • 1
    @Mgetz: I know that. In fact, I assumed the OP meant WCHAR as defined in Win32 SDK headers, and his question was about the Win32 environment. Note that I wrote: _"`WCHAR` (or `wchar_t` **on Visual C++ compiler**)"_. – Mr.C64 Apr 17 '14 at 15:29
  • In fact, wchar_t does not store any encoding information at all. It is a type that is always wider (more bytes per word) than a char. – Bruno Ferreira Apr 17 '14 at 15:45
  • 1
    @Mr.C64 That seems to be a common assumption, one that I wouldn't make as the OP has not specified a compiler. – Mgetz Apr 17 '14 at 15:52
  • 1
    are you sure it's UTF-16 in windows and not UCS-2 ? – AlexDan Apr 17 '14 at 15:53
  • 2
    @BrunoFerreira Actually `wchar_t` is not necessarily any wider than `char`. The only requirement is that `wchar_t` be large enough to store a unique value for every member of the largest character set supported by an implementation. So if an implementation's largest character set is smaller than 256 then `wchar_t` can be 8 bits. – bames53 Apr 17 '14 at 16:28
  • @AlexDan Yes, Window APIs use `wchar_t` to store UTF-16, including characters outside the BMP. It's somewhat ambiguous whether this behavior conforms to the C++ specification. – bames53 Apr 17 '14 at 16:30
  • @AlexDan: If you go back far enough (NT4 I think) then it was UCS-2, but it's been UTF-16 ever since. – RichieHindle Apr 17 '14 at 16:44
  • Great mention of `MultiByteToWideChar()` and `WideCharToMultiByte()`! Very helpful ^_^ – kayleeFrye_onDeck Aug 11 '17 at 04:02
4

The right question is not which type to use, but what should be your contract with your library users. Both char and wchar_t can mean more than one thing.

The right answer to me, is use char and consider everything utf-8 encoded, as utf8everywhere.org suggests. This will also make it easier to write cross-platform libraries.

Make sure you make correct use of strings though. Some APIs like fopen(), would accept a char* string and treat it differently (not as UTF-8) when compiled on Windows. If Unicode is important to you (and it probably is, when you are dealing with strings), be sure to handle your strings correctly. A good example can be seen in boost::locale. I also recommend using boost::nowide on Windows to get strings handled correctly inside your library.

Pavel Radzivilovsky
  • 18,794
  • 5
  • 57
  • 67
2

In Windows we stick to WCHARS. std::wstring. Mainly because if you don't you end up having to convert because calling Windows functions.

I have a feeling that trying to use utf8 internally simply because of http://utf8everywhere.org/ is gonna bite us in the bum later on down the line.

Epirocks
  • 480
  • 4
  • 11
0

It is best recommended that, when developing a Windows application, resort to TCHARs. The good thing about TCHARs is that they can be either regular chars or wchars, depending whether the unicode setting is set or not. Once you resort to TCHARs, you make sure that all string manipulations that you use also start with the _t prefix (e.g. _tcslen for length of string). That way you will know that your code will work both in Unicode and ASCII environments.

santahopar
  • 2,933
  • 2
  • 29
  • 50
  • 4
    `TCHAR` and the ability to switch between `char` and `wchar_t` were useful for migrating legacy programs from legacy encoded `char` to `wchar_t`. `TCHAR` should not be used for any other purpose. New software should not be written with `TCHAR`: new Windows code should explicitly use either (UTF-8 encoded) `char` , or `wchar_t`. – bames53 Apr 17 '14 at 16:37
  • 2
    The really bad thing about `TCHAR` is that it can be either `char` or `wchar_t`, since you have to write distinctively different code depending on which one you use. Whatever you choose (and frankly, unless you are doing text processing, it should be `char`), use it, and not `TCHAR`. – James Kanze Apr 17 '14 at 17:08
  • 1
    @armanali Which encoding format? You have to handle whatever encoding format you receive. If it's UTF-8, then you write code which handles UTF-8; it it's UTF-16 (BE or LE), then you write code which handles UTF-16; if it's UTF-32 (BE or LE), then you write code which handles UTF-32. – James Kanze Apr 18 '14 at 20:40
  • 1
    I would say it is very bad to use setting-sensitive types (As TCHAR, which depend on UNICODE define) in a library. The question is about a library. Agree with utf8everywhere.org. – Pavel Radzivilovsky Jun 01 '15 at 15:24