Should string encoding for library conform to Unicode or flexible?

Question

I am created a library in C++ which exposes c style interface APIs. Some of the arguments are string so they would be char *. Now I know they should be all Unicode but because it is a library I don't think I want to force users to use decide or not. Ideally I thought it would be best to use TCHAR so I can build it either way for unicode code and ASCII users. Than I read this and it opposes the idea in general.

As an example of API, the strings are filenames or error messages like below.

void LoadSomeFile(char * fileName );
const char * GetErrorMsg();

I am using c++ and STL. There is this debate of std::string vs std::wstring as well. Personally I really like MFC's CString class which takes care of all this nicely but that means I have to use MFC just for its string class.

Now I think TCHAR is probably the best solution for me but do I have to use CString (internally) for that to work? Can I use it with STL string? As far as I can see, it is either string or wstring there.

score 1 · Answer 1 · answered Mar 27 '14 at 19:37

1

The TCHAR type is an unfortunate design choice that has thankfully been left behind us. Nobody has to use TCHAR any more, thank goodness. The Unicode choice has been made for us as well: Unicode is the only sane choice going forwards.

The question is, is your library Windows-only? Or is it portable?

If your library is portable, then the typical choice is char * or std::string with UTF-8 encoded strings. For more information, see UTF-8 Everywhere. The summary is that wchar_t is UTF-16 on Windows but UTF-32 everywhere else, which makes it almost useless for cross-platform programming.
If your library runs on Win32 only, then you may feel free to use wchar_t instead. On Windows, wchar_t is UTF-16.

Don't use both, it will make your code and API bloated and difficult to read. TCHAR is a hack for supporting the Win32 API and migrating to Unicode.

answered Mar 27 '14 at 19:37

Dietrich Epp

205,541
37
345
415

But if I use `char *` my library will no longer be compatible with unicode applications and vice versa. This will only be used in windows so I don't have to worry about cross compatibility but I do want to make it work with both unicode and non-unicode applications. – zar Mar 27 '14 at 19:43
1

@zadane: Tell your users to pass Unicode strings in by converting them to UTF-8. Lots of libraries work this way already, I've used several. – Dietrich Epp Mar 27 '14 at 19:45
1

@zadane: I think you may be under the misconception that `char *` is not Unicode. This is inaccurate, `char *` is really just a data type, and "Unicode" is a way of representing characters. You can store a Unicode string in `char *` using UTF-8, in `short *` using UTF-16, or in `int *` using UTF-32. The `wchar_t` type is basically either `short` or `int` (or unsigned versions thereof) depending on whether you're on Windows or somewhere else, but `char *` is available *and consistent* on all common platforms. (This comment assumes typical widths for integral types.) – Dietrich Epp Mar 27 '14 at 19:49
@Deitrich So are you saying I can pass just pass `char *` and still be unicode? In theory that should be fine but I want string that I can easily manipulate as well. The TCHAR does translate to `char *` or `wchar_t` accordingly based on project/encoding settings. – zar Mar 27 '14 at 20:17
Yes, you can pass `char *` and still be Unicode. It is much easier to manipulate than `TCHAR`. With `TCHAR`, you don't know what encoding you are using and you don't know if it's Unicode at all--very hard to work with, very frustrating. With `wchar_t`, you don't know what encoding it is but you know it's Unicode -- still frustrating. If you put UTF-8 in a `char *`, which is incredibly common in C and C++, then you know exactly what encoding you are using--so it is very easy to manipulate strings. (Note the standard doesn't actually require `wchar_t` to be Unicode, but it is in practice.) – Dietrich Epp Mar 27 '14 at 21:17
Keep in mind that `wchar_t` is a variable-length encoding on Windows, so you can't slice strings at arbitrary points. `TCHAR` could be something monstrous like Shift JIS, which is also variable length. At least with UTF-8, you know what you are dealing with, and it's compatible with functions like `strlen()`, `snprintf()`, etc. – Dietrich Epp Mar 27 '14 at 21:20

Should string encoding for library conform to Unicode or flexible?

1 Answers1