C++ read and write UTF-32 files

Question

I want to write a language learning app for myself using Visual Studio 2017, C++ and the WindowsAPI (formerly known as Win32). The Operation System is the latest Windows 10 insider build and backwards-compatibility is a non-issue. Since I assume English to be the mother tounge of the user and the language I am currently interested in is another European language, ASCII might suffice. But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32. I have previously used both UTF-8 and UTF-16, though I have more experience with the later.

Thanks to std::basic_string, it was easy to figure out how to get an UTF-32 string:

typedef std::basic_string<char32_t> stringUTF32

Since I am using the WinAPI for all GUI staff, I need to do some conversion between UTF-32 and UTF-16.

Now to my problem: Since UTF-32 is not widely used because of its inefficiencies, there is hardly any material about it on the web. To avoid unnecessary conversions, I want to save my vocabulary lists and other data as UTF-32 (for all UTF-8 advocates/evangelists, the alternative would be UTF-16). The problem is, I cannot find how to write and open files in UTF-32.

So my question is: How to write/open files in UTF-32? I would prefer if no third-party libraries are needed unless they are a part of Windows or are usually shipped with that OS.

"*But I want to future-proof it (more excotic languages) and I also want to try my hands on UTF-32.*" UTF-8 is no less "future-proof" than UTF-32. That is the whole point of UTF formats: they all encode the exact same ranges of data. — Nicol Bolas, May 02 '18 at 17:13
With the comment from @NicolBolas, rather consider using [UTF-8 everywhere](http://utf8everywhere.org/). Then at least your text files will be readable by everyone else. — Some programmer dude, May 02 '18 at 17:17
Just to clarify, when you say "write and open" files, are you implicitely referring to iostream-style formatted IO, or just working with raw data? — , May 02 '18 at 17:21
Yes, but they have different strengths and weaknesses. With UTF-32 it is possible that each character takes up exactly one code point. So the length of the string equals the number of characters. This reduces problems when it comes to character manipulations. And yes, I am aware that you can compose French characters and German umlauts from multiple symbols instead of using one. é = é or é = ´ + e; But unless the system is configured to do this, the program should work as reliable as a good old ANSII program. And that is what I want to achieve. — Willi, May 02 '18 at 17:22
*How to write/open files in UTF-32?* - what you mean under this ? how encoding at all related to open or write to file operation ? — RbMm, May 02 '18 at 17:23
@Frank I mean I want to create new files, read existing ones and modify them if needed. Whicher method is best, I am content with. — Willi, May 02 '18 at 17:25
@Willi The reason I ask is that at the end of the day, data is data. Reading and writing files has nothing to do with encoding unless you are trying to use the STL's formatted IO capabilities. — , May 02 '18 at 17:26
Are you saying that you want to be able to get externally created text file, in whatever encoding they may have, and then convert the contents and save it back as UTF-32? Then what will happen when the original program that created the file want to read it and expect it to be whatever encoding it used to write it in? Don't touch files other programs created without thinking *very* carefully about the consequences. — Some programmer dude, May 02 '18 at 17:27
@Some programmer dude I am against UTF-8 everywhere. It is as stupid as saying you should always use Java, C#, C++, etc. Windows uses UTF-16, so that is the best option unless there is a specific reason to not use it unless of course, you are a fan of wasting cycles for useless conversions. UTF-8 is the preferred choice for publishing on the Internet or if the resulting file is meant to be used separately from the program and may, therefore, be used on different platforms. — Willi, May 02 '18 at 17:30
@ Some programmer dude You can assume that existing files have the right encoding. Therefore, any file the program tries to open will be in UTF-32. The reason for this is that any files are either created by the program itself or are specifically created for it. — Willi, May 02 '18 at 17:33
So, is there a standard library function or do I need to read the document as raw data and interpret it as UTF-32? — Willi, May 02 '18 at 17:35
@Willi - encoding have nothing common with files. about what question ? — RbMm, May 02 '18 at 17:35
@Willi, UTF-32 IS a raw data format, you just copy the bytes into the string, this is why your question is so confusing. — , May 02 '18 at 17:36
Also, just so you know, even in UTF-32, many text-based operations are STILL multi-symbol by nature if you want to do things properly, even something as simple as searching for a letter, so you are not actually gaining much relative to UTF-8 — , May 02 '18 at 17:38
So you don't want to have "useless conversion" between UTF-8 and UTF-16, but it's fine for conversions between your preferred UTF-32 and UTF-16? You *do* know that still needs conversions? And unless you only want to target a single CPU architecture (Intel x86 and derivatives) you also have to handle endianness issues causing even more possible "wasting cycles". And code-points are really irrelevant, what with combining characters and such. — Some programmer dude, May 02 '18 at 17:39
@Willi just FYI, there is no need for your `stringUTF32` typedef, as C++11 and later have `std::u32string` available. But if you want to have `stringUTF32`, then at least use `std::u32string` in your `typedef` instead of `std::basic_string` directly, eg: `typedef std::u32string stringUTF32;` or `using stringUTF32 = std::u32string;` — Remy Lebeau, May 02 '18 at 17:43
@Frank Well, if that is the case, I will probably have to use UTF-16. That way, I'll at least save the trouble of having to implement a conversion layer between the program logic and the WinAPI. And since all files are only meant to be used by the program internally, there is nothing gained by making the more accessible. Especially since there will be only a Windows version. — Willi, May 02 '18 at 17:45
By the way, I'm not going to stop you from using UTF-32 when- and where-ever you want, that's not my intention. My intention is to make you think a little extra about the reasons you want to do something completely different from everybody else. — Some programmer dude, May 02 '18 at 17:48
@Some programmer dude Well, there is a difference. UTF-8 has absolutely no advantage for my use case. This program will be a Windows exclusive, otherwise, I would not bother with the WinAPI. And on Windows, UTF-16 is the best-supported standard. Secondly, the files are only meant for internal use by the program. They are no output meant for the user to edit, read, publish, etc. With UTF-32, I hoped to reduce the problem that 1 code-point does not equal one character. Therefore, reducing the chance that the program will not work correctly, should I, for instance, choose to use Japanese symbols. — Willi, May 02 '18 at 17:52
I do use UTF-8, but mostly with the web, for Linux or with programming languages that use it as a preferred encoding. Since the native encoding of Windows is UTF-16 and C++ is mostly encoding agnostic, I use UTF-16 to prevent an artificial barrier between my GUI and between my logic. But even with the C++ and WinAPI combo, I use UTF-8 to create files meant by the user to be independently used from the program. — Willi, May 02 '18 at 17:58
UTF-8 is good to use for disk storage, network comms, etc. UTF-16 strikes a good balance between memory usage and logic complexity, as MOST languages don't use glyphs outside the BMP (Eastern Asian languages, Emojis, etc). When processing UTF data, you usually have to process in UTF-32 anyway for codepoint comparisons, but you can do that on a per-codepoint basis without wasting memory. There is little benefit to storing UTF-32 on disk, or using UTF-32 strings in memory. Plus, UTF-32 doesn't solve the "1 codepoint != 1 grapheme" issue - graphemes are what users tend to think of as characters. — Remy Lebeau, May 02 '18 at 19:21

Nicol Bolas · Accepted Answer · 2018-05-02T19:21:20.597

If you have a char32_t sequence, you can write it to a file using a std::basic_ofstream<char32_t> (which I will refer to as u32_ofstream, but this typedef does not exist). This works exactly like std::ofstream, except that it writes char32_ts instead of chars. But there are limitations.

Most standard library types that have an operator<< overload are templated on the character type. So they will work with u32_ofstream just fine. The problem you will encounter is for user types. These almost always assume that you're writing char, and thus are defined as ostream &operator<<(ostream &os, ...);. Such stream output can't work with u32_ofstream without a conversion layer.

But the big issue you're going to face is endian issues. u32_ofstream will write char32_t as your platform's native endian. If your application reads them back through a u32_ifstream, that's fine. But if other applications read them, or if your application needs to read something written in UTF-32 by someone else, that becomes a problem.

The typical solution is to use a "byte order mark" as the first character of the file. Unicode even has a specific codepoint set aside for this: \U0000FEFF.

The way a BOM works is like this. When writing a file, you write the BOM before any other codepoints.

When reading a file of an unknown encoding, you read the first codepoint as normal. If it comes out equal to the BOM in your native encoding, then you can read the rest of the file as normal. If it doesn't, then you need to read the file and endian-convert it before you can process it. That process would look at bit like this:

constexpr char32_t native_bom = U'\U0000FEFF';

u32_ifstream is(...);
char32_t bom;
is >> bom;
if(native_bom == bom)
{
  process_stream(is);
}
else
{
  basic_stringstream<char32_t> char_stream
  //Load the rest of `is` and endian-convert it into `char_stream`.
  process_stream(char_stream);
}

I have implemented the read function without this check as it is not strictly necessary for my use case (That was before you posted this answer). But I do think that I should include it for good-practice and to avoid trouble. Though with a slight alteration: if the encoding does not match the native encoding, skip that file with an error message. When I have time, I shall inform myself how to do an endian-conversion — Willi, May 02 '18 at 19:19
std::basic_ifstream input_stream(filename, std::ios::in); results in an error: Severity Code Description Project File Line Suppression State Error LNK2001 unresolved external symbol "__declspec(dllimport) public: static class std::locale::id std::codecvt::id" — Willi, May 02 '18 at 20:16

score 1 · Answer 2 · answered May 03 '18 at 05:08

I am currently interested in is another European language, [so] ASCII might suffice

No. Even in plain English. You know how Microsoft Word creates “curly quotes”? Those are non-ASCII characters. All those letters with accents and umlauts in eg. French or English are non-ASCII characters.

I want to future-proof it

UTF-8, UTF-16 and UTF-32 all can encode every Unicode code point. They’re all future-proof. UTF-32 does not have an advantage over the other two.

Also for future proofing: I’m quite sure some scripts use characters (the technical term is ‘grapheme clusters’) consisting of more than one code point. A cursory search turns up Playing around with Devanagari characters.

A downside of UTF-32 is support in other tools. Notepad won’t open your files. Beyond Compare won’t. Visual Studio Code… nope. Visual Studio will, but it won’t let you create such files.

And the Win32 API: it has a function MultiByteToWideChar which can convert UTF-8 to UTF-16 (which you need to pass in to all Win32 calls) but it doesn’t accept UTF-32.

So my honest answer to this question is, don’t. Otherwise follow Nicol’s answer.

Thank you for your answer and sorry for the late reply, I was busy. I have chosen to use UTF-16 the day I posted my question. I would have given UTF-32 a try, if for no other reason than curiosity, but the lack of documentation is simply too much of a problem. Back to your answer: I probably should not have written ANSII but ANSII + Microsoft extensions or the code pages Windows uses while in ANSII mode. Therefore, Windows does support enough of the Latin Alphabet if you only need a bilingual system. English + another language that uses the Latin alphabet. Anyways, that support is good enough. — Willi, May 09 '18 at 23:55
for quite a few of my applications. Nevertheless, I do mostly use UNICODE mode. There are programs where I care a lot that every char == a character and I could not care less if someone can type their French, German, Russian or Japanese name. That is the reason I still occasionally use the old mode. — Willi, May 09 '18 at 23:56

C++ read and write UTF-32 files

2 Answers2