Cross-platform unicode in C/C++: Which encoding to use?

Question

I'm currently working on a hobby project (C/C++) which is supposed to work on both Windows and Linux, with full support for Unicode. Sadly, Windows and Linux use different encodings making our lives more difficult.

In my code I'm trying to use the data as universal as possible, making it easy for both Windows and Linux. In Windows, wchar_t is encoded as UTF-16 by default, and as UCS-4 in Linux (correct me if I'm wrong).

My software opens ({_wfopen, UTF-16, Windows},{fopen, UTF-8, Linux}) and writes data to files in UTF-8. So far it's all doable. Until I decided to use SQLite.

SQLite's C/C++ interface allows for one or two-byte encoded strings (click). Ofcourse this does not work with wchar_t in Linux, as the wchar_t in Linux is 4 bytes by default. Therefore, writing and reading from sqlite requires conversion for Linux.

Currently the code is cluttering up with exceptions for Windows/Linux. I was hoping to stick to the standard idea of storing data in wchar_t:

wchar_t in Windows: Filepaths without a problem, reading/writing to sqlite without a problem. Writing data to a file should be done in UTF-8 anyway.
wchar_t in Linux: Exception for the filepaths due to UTF-8 encoding, conversion before reading/writing to sqlite (wchar_t), and the same for windows when writing data to a file.

After reading (here) I was convinced I should stick to wchar_t in Windows. But after getting all that to work, the trouble began with porting to Linux.

Currently I'm thinking of redoing it all to stick with simple char(UTF-8) because it works with both Windows and Linux, keeping the fact in mind that I need to 'WideCharToMultiByte' every string in Windows to achieve UTF-8. Using simple char* based strings will greatly reduce the number of exceptions for Linux/Windows.

Do you have any experience with unicode for cross-platform? Any thoughts about the idea of simply storing data in UTF-8 instead of using wchar_t?

2byte character encoding is definitely *not* UTF-16. UTF-16 is 2 to 4 bytes, and UTF-8 is 1 - 4 bytes. Windows `wchar_t` is not UTF-16, it is UCS2. In practice you may not notice the difference because UCS2 covers the BMP but if ever your users decide that they must have data in Ogham or runes... — user268396, Jun 28 '12 at 00:25
Windows DOES use UTF-16, and DOES use `wchar_t` to hold UTF-16 data, and has done so since Windows 2000. — Remy Lebeau, Jun 28 '12 at 00:45
On how useful wchar_t is and for what: http://stackoverflow.com/a/11107667/365496 — bames53, Jun 28 '12 at 02:23
@RemyLebeau: I think that depends on the context. For example, you can set a password that isn't valid Unicode, and the console functions (such as WriteConsoleOutputCharacter) only seem to allow a single 16-bit word (presumably interpreted as UCS2) at each console coordinate. — Harry Johnston, Jun 28 '12 at 03:59
http://www.utf8everywhere.org pretty much answers this question, in the very site's URL :) — Pavel Radzivilovsky, Jun 28 '12 at 14:53
To the person throwing that close vote, don't do that without informing a (new!) user as to why the vote was cast. Not that I agree with it anyway. Welcome ErikKou! — Maarten Bodewes, Jun 28 '12 at 23:59
@owlstead: Thank you. :) I have not been able to check my question earlier. From the comments I can tell the world of Unicode has not been as standardized as much as we'd hope. — ErikKou, Jun 29 '12 at 12:56

score 8 · Accepted Answer · answered Jun 28 '12 at 00:21

8

UTF-8 on all platforms, with just-in-time conversion to UTF-16 for Windows is a common tactic for cross-platform Unicode.

answered Jun 28 '12 at 00:21

Puppy

144,682
38
256
465

I'd slightly adjust that statement and say: Native encoding on all platforms, with just-in-time conversion to/from UTF-8. That just-in-time conversion is required, whenever character strings leave the application (e.g. writing to a file, sending data over a network socket, passing input to a library, etc.). Of course, it all depends on the specific scenario. – IInspectable Aug 07 '16 at 11:15
Unicode, and more specifically UTF-8, is one of humanity's most elegant and impressive creations and social institutions. I feel so lucky to have started developing after UTF-8 settled in as a standard. – iono Dec 29 '20 at 11:01

score 3 · Answer 2 · answered Jun 28 '12 at 00:41

3

Our software is cross-platform as well, and we faced similar problems. We decided that our goal is to have the least amount of conversions possible. This means that we use wchar_t on Windows and char on Unix/Mac.

We do this by supporting _T and LPCTSTR and similar on Unix and by having generic functions that easily convert between std::string and std::wstring. We also have a generic std::basic_string<TCHAR> (tstring) which we use in most cases.

So far this works quite well. Basicly most functions take a tstring or a LPCTSTR and those which don't will get their parameters converted from a tstring. That means that most of the time we don't convert our strings and pass through most parameters.

answered Jun 28 '12 at 00:41

Fozi

4,973
1
32
56

2

This is a possible solution too, but still a bit hacky. Also, from my reading I have learned that I should avoid using TCHAR as it was introduced to support backwards compatibility with older software by switching to MBCS instead of the Unicode flag. – ErikKou Jun 29 '12 at 13:04
@Fozi, How do I support _T on Ubuntu Linux? Thank you very much. – Frank Oct 22 '15 at 01:09
@ErikKou, What is your possible solution for emulating the Windows macro _T in Unix or Linux? Thank you. – Frank Oct 22 '15 at 01:11

Cross-platform unicode in C/C++: Which encoding to use?

2 Answers2

Linked