Read file with unknown character type

Question

I need to read text from a file that could contain any type of character (char, char8_t, wchar_t, etc). How can i determine which type of character is used and create an instance of basic_ifstream<char_type> depending on that type?

score 0 · Answer 1 · answered Dec 27 '20 at 19:19

It's impossible to know for sure. You have to be told what the character type is. Frequently text files will begin with a Byte-Order-Mark to clue you in, but even that's not entirely foolproof.

You can make reasonable guesses as to the file contents, for example, if you "know" that most of it is ASCII-range text, it should be easy to figure out if the file is full of char or wchar_t characters. Even this relies on assumptions and should not be considered bulletproof.

prapin · Accepted Answer · 2020-12-27T22:38:18.290

So I guess you want to auto-detect the encoding of an unknown text file.

This is impossible to do in a 100% reliable way. However, my experience shows that you can achieve very high reliability (> 99.99%) in most practical situations. The bigger the file, the most reliable it is to guess its encoding: some tenths of bytes is usually already enough to be confident with the guess.

A valid Unicode code point is a value from U+1 to U+10FFFF included, excluding the surrogate range U+D800 to U+DFFF. Code point U+0 is actually valid, but excluding it highly decreases the number of false positive guesses (NUL bytes should never appear in any practical text file). For an even better guess, we can exclude some more very rare control characters.

Here is the algorithm I would propose:

If the file begins with a valid BOM (UTF-8, UTF-16BE/LE, UTF-32BE/LE), trust that BOM.
If the file contains only ASCII characters (non null bytes < 128), treat it as ASCII (use char).
If the file is valid UTF-8, then assume it is UTF-8 (use char8_t, but char will work also). Note that ASCII is a subset of UTF-8, so the previous check could be bypassed.
If the file is valid UTF-32 (check both little and big endian versions), then assume UTF-32 (char32_t, possibly also wchar_t on Linux or macOS). Swap the bytes if needed.
If the file is valid UTF-16 (check both little and big endian versions), including restrictions on surrogate pairs, and there is a higher correlation between even or odd bytes than between all bytes together, assume UTF-16 (char16_t, possibly also wchar_t on Windows). Swap the bytes if needed.
Otherwise, the file is probably not in some Unicode encoding, and may use old code pages. Good luck to auto-detect which one. The more common one by far is 8859-1 (Latin-1), use char. It may also be some raw binary data.

Read file with unknown character type

2 Answers2