C++ Correctly read files whose Unicode characters might be larger than a byte

Question

I've spent many hours now reading about Unicode, its encodings and many related topics.
The reason behind my research is because I am trying to read the contents of a file and parse them character by character.

Correct me if I am wrong please:

C++'s getc() returns an int which might equal EOF.
If the return value does not equal EOF it can be ~~interpreted as a~~ safely assigned to a char.
Since std::string is based on char we can build std::strings with these chars and use those.

I have a C# background where we use C#'s char (16bit) for strings.
The value of these chars map directly to unicode values.
A char whose value is 5 is equal to the unicode character located at U+0005.

What I don't understand is how to read a file in C++ that contains characters whose values might be larger than a byte. I don't feel comfortable using getc() when I can only read characters whose values are limited to a byte.

I might be missing an important point on how to correctly read files with C++.
Any insights are very much appreciated.

I am running a Windows 10 x64 using VC++.
But I'd prefer to keep this question platform-independent if that is possible.

EDIT

I'd like to emphasize a stack overflow post linked in the comments by Klitos Kyriacou:
How well is Unicode supported in C++11?

It's a quick dive into how bad Unicode is supported in C++.
For more details you should read/watch the resources provided in the accepted answer.

Have you had a chance to look at `std::wstring` and/or `wchar_t`? — Vada Poché, Feb 22 '17 at 23:19
Whereas C# (and Java for that matter) does your encoding/decoding automatically during the read/write operations, in C++ you have to read your bytes as bytes and then use [std::codecvt](http://en.cppreference.com/w/cpp/locale/codecvt). See also the question [How well is Unicode supported in C++11?](http://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11). — Klitos Kyriacou, Feb 22 '17 at 23:36
The Unicode support in C++ not great. Depending on what you want to do with those strings in the end, consider using a library light ICU. — Baum mit Augen, Feb 22 '17 at 23:37
@KlitosKyriacou Thanks for the link! Sounds promising ... not. I'll have to read it a couple more times :D — Noel Widmer, Feb 22 '17 at 23:48
_The value of these chars map directly to unicode values._ - You are aware that 16-bit data type can't hold the full unicode range? — zett42, Feb 23 '17 at 00:58
@zett42 Yes, I am. But as far as I know .NET (Windows in general) is built arount UTF16. 16 bit characters are a core part of the framework. Not sure how we would read 24 or 32 bit Unicode character using .NET. But honestly, I never had issues with the .NET character encodings (yet). — Noel Widmer, Feb 23 '17 at 08:25
I have updated my answer to read an UTF-8 file and convert to UTF-16 string. — zett42, Feb 23 '17 at 21:31

zett42 · Answer 1 · 2017-02-24T22:44:03.057

The equivalent for a 16-bit "character" that is compatible with the Windows API would be wchar_t. Be aware though that wchar_t might be 32-bit on some platforms, so use char16_t if you want to store an UTF-16 encoded string in a platform-independent way.

If you use char16_t on the Windows platform you have to do some casts though when passing strings to the OS API.

The equivalent string types are:

std::wstring (wchar_t)
std::u16string (char16_t)

File stream types:

std::wifstream (a typedef for std::basic_ifstream<wchar_t>)
std::basic_ifstream<char16_t>
std::wofstream (a typedef for std::basic_ofstream<wchar_t>)
std::basic_ofstream<char16_t>

Example to read an UTF-8 encoded file into an UTF-16 string:

#include <windows.h>
#include <fstream>
#include <string>
#include <locale>
#include <codecvt>

int main()
{   
    std::wifstream file( L"test_utf8.txt" );

    // Apply a locale to read UTF-8 file, skip the BOM if present and convert to UTF-16.
    file.imbue( std::locale( file.getloc(),
        new std::codecvt_utf8_utf16<wchar_t, 0x10ffff, std::consume_header> ) );

    std::wstring str;
    std::getline( file, str );

    ::MessageBox( 0, str.data(), L"test", 0 );

    return 0;
}

How to read an UTF-16 encoded file into a 16-bit std::wstring or std::u16string?

Apparently this isn't so easy. There is std::codecvt_utf16 but when used with 16-bit wchar_t character type it produces UCS-2 which is only a subset of UTF-16, so surrogate pairs won't be read correctly. See cppreference example.

I don't know how the C++ ISO committee came to this decision, because it's completely useless in practice. At least they should have provided a flag so we could choose if we want to restrict ourselfs to UCS-2 or want to read the full UTF-16 range.

Maybe there is another solution but right now I'm not aware of it.

+1 for multiple reasons: Pointing out wstring and wchar_t which I NOW believe shouldn't be used for UTF8, describing different file streams and providing a sample for both UTF8 and UTF16. I did not accept this as the answer because I believe a library such as the utfcpp (mentioned by Trevor) can handle validation and decoding much better than any sample I can understand/write. Still, if someone is looking for a self written solution, this is probably the way to go. Thanks for the info zett42 :) — Noel Widmer, Feb 23 '17 at 22:33
Unfortunately I had to remove the UTF-16 example because it only read UCS-2 (see above). Should read documentation more carefully... — zett42, Feb 24 '17 at 22:45
Just read the file as binary. Use the standard library's machinery where it works effortlessly, do something else where it's yuck-ish. At times in the past my "do something else" has included writing UTF-8 codecvt-ers from scratch, but now with C++11 and later the library is not sufficiently yuck-ish in that respect, to justify the effort. — Cheers and hth. - Alf, Feb 24 '17 at 22:53
It gives UCS-2 because that's all it can do faced with an implementation where wchar_t is 16 bit. C++ standard requires wchar_t to be big enough for every code point, meaning 32 bits (as of 1996). — Cubbi, Mar 13 '17 at 13:44

score 2 · Answer 2 · answered Feb 23 '17 at 23:59

The situation is that C's getc() was written in the 1970s. To all intents and purposes, it means "read an octet", not "read a character". Virtually all binary data is built on octets.

Unicode allows characters beyond the range an octet can represent. So, naively, the Unicode people proposed a standard for 16 bit characters. Microsoft then incorporated the proposal early on and added wide characters (wchar_t and so on) to Windows. One problem was that 16 bits are not enough to represent every glyph in every human language with some status, another was the endianness of the binary files. So the Unicode people had to add a 32-bit unicode standard, and they then had in incorporate a little enianness and format tag at the start of Unicode files. Finally, the 16-bit Unicode glyphs didn't quite match Microsoft's wchar_t glyphs.

So the result was a mess. It is quite difficult to read and display 16 or 32 bit Unicode files with complete accuracy and portability. Also, very many programs were still using 8 bit ascii.

Fortunately, UTF-8 was invented. UTF-8 is backwards compatible with 7-bit ascii. If the top bit is set, then the glyph is encoded by more than one character, and there's a scheme that tells you how many. The nul byte never appears except as an end-of-string indicator. So most programs will process UTF-8 correctly, unless they try to split strings or otherwise try to treat them as English.

UTF-8 has the penalty that random access to chars isn't possible, because of the variable length rule. But that's a minor disadvantage. Generally UTF-8 is the way to go for saving Unicode text and passing it about in programs, and you should only break it out into Unicode code points when you actually need the glyphs, e.g. for display purposes.

+1 for providing history. It should also be said that UTF-16 is very error-prone, because even if developers are not aware of surrogate pairs, it will work 99% of time because these devs will most like only test with code points in the UCS-2 range. — zett42, Feb 24 '17 at 22:49
*UTF-8 has the penalty that random access to chars isn't possible* ... that's also true for UTF-16 and even UTF-32, because an [abstract character](https://en.wikipedia.org/wiki/Unicode#Abstract_characters) can be composed of multiple unicode characters. — zett42, Feb 24 '17 at 22:55

score 0 · Accepted Answer · answered Feb 22 '17 at 23:42

0

I'd recommend watching Unicode in C++ by James McNellis.
That will help explain what facilitates C++ has and does not have when dealing with Unicode.
You will see that C++ lacks good support for easily working with UTF8.

Since it sounds like you want to iterate over each glyph (not just code points),
I'd recomend using a 3rd pary library to handle the intricacies.
utfcpp has worked well for me.

answered Feb 22 '17 at 23:42

Trevor Hickey

36,288
32
162
271

The talk you linked provides some important insights into Unicode support in c++. I can recommend it to anyone who wants to better understand character encodings in gerneral (not only c++). I will go with utfcpp because from what I have figured out it appears to provide the best functionality for 1) validation and 2) conversion. – Noel Widmer Feb 23 '17 at 22:37

C++ Correctly read files whose Unicode characters might be larger than a byte

EDIT

3 Answers3