How to read a UTF-16 text file in C++17

Question

I am very new to C++. I want to read a UTF-16 text file in C++17 in Visual Studio 2019.

I have tried several methods in the internet (including StackOverflow) but none of them worked, and some of them didn't compile (I think they only support older compilers).

I am trying to achieve this without using any 3rd party libraries.

This reads a text file, but it has some weird characters and spaces between each letter.

// open file for reading
std::wifstream istrm(filename, std::ios::binary);
if (!istrm.is_open()) {
    std::cout << "failed to open " << filename << '\n';
}
else {
    std::wstring s;
    std::getline(istrm, s);
    std::wcout << s << std::endl;
}

Then I found some solutions for this using the following libraries

#include <locale>
#include <codecvt>

// open file for reading
std::wifstream istrm(filename, std::ios::binary);
istrm.imbue(std::locale(istrm.getloc(), new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
if (!istrm.is_open()) {
    std::cout << "failed to open " << filename << '\n';
}
else {
    std::wstring s;
    std::getline(istrm, s);
    std::wcout << s << std::endl;
}

This time it didn't even compile, got the following errors at the std::codecvt_utf16 line:

Error C4996 'std::codecvt_utf16': warning STL4017: std::wbuffer_convert, std::wstring_convert, and the header (containing std::codecvt_mode, std::codecvt_utf8, std::codecvt_utf16, and std::codecvt_utf8_utf16) are deprecated in C++17. (The std::codecvt class template is NOT deprecated.) The C++ Standard doesn't provide equivalent non-deprecated functionality; consider using MultiByteToWideChar() and WideCharToMultiByte() from instead. You can define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING or _SILENCE_ALL_CXX17_DEPRECATION_WARNINGS to acknowledge that you have received this warning.

I would appreciate if someone can provide a solution for this.

Thanks in advance.

So there is no way to do this without using any 3rd party library? — , Jun 23 '19 at 11:36
The Standard Library isn't really mature enough yet to properly handle Unicode. Furthermore, the entire `` header was deprecated in C++17. — DeiDei, Jun 23 '19 at 11:38
You could write the code yourself. Decoding UTF-16 isn't all that difficult. Just steal some code off of github ;) — DeiDei, Jun 23 '19 at 11:42
@DeiDei thanks for your input, I will try to find some good code on GitHub :) — , Jun 23 '19 at 11:45
just out of curiosity, if I try and get a normal `char*(std::string)` from the UTF-16 text file(which has weird characters and spaces) and if I convert this `std::string` to `char16_t*` will I get the correct output? — , Jun 23 '19 at 11:48
The entire library was not deprecated. The standard library is completely sufficient to do what you want. Someone will post an answer in due time, don't give up! :) — cyberbisson, Jun 23 '19 at 11:49
Please post the compilation error for those of us that don't have Visual Studio. — cyberbisson, Jun 23 '19 at 11:51
probably a duplicate of: https://stackoverflow.com/questions/17103925/how-well-is-unicode-supported-in-c11 — P. PICARD, Jun 23 '19 at 11:51
@cyberbisson I've added the error as well, it says to use `MultiByteToWideChar() ` which I've tried following a tutorial, but couldn't get it to work.(I am very new to C++) — , Jun 23 '19 at 11:58
It is not an error. It is going to show this *warning* for at least another decade, such are the benefits of using a C++ compiler with an 1-800 support phone number. So simply #define _SILENCE_CXX17_CODECVT_HEADER_DEPRECATION_WARNING to move on with your life. — Hans Passant, Jun 23 '19 at 12:07
On Windows, you don't need to do any encoding conversion, nor any external library for this task. Create a `std::ifstream` in binary mode, then simply read the whole file into `std::wstring` and be done with it. Assuming you want to read `UTF-16 LE` only. For `UTF-16 BE` (rarely used on Windows) you need to swap every 2nd byte. — zett42, Jun 23 '19 at 12:41
I see, my text file seems to be `UTF-16 BE` that's maybe why there is an extra space between every letter and the first letter is a square — , Jun 23 '19 at 13:39
The 1st "letter" you are seeing propably is the [BOM](https://en.wikipedia.org/wiki/Byte_order_mark). You can use it to distinguish between UTF-16 LE/BE. — zett42, Jun 23 '19 at 14:18
Microsoft's [fopen](https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/fopen-wfopen?view=vs-2019) can be used to read UTF-encoded files as well; e.g. `fopen("newfile.txt", "rt+, ccs=UTF-16LE")`. — Mark Tolonen, Jun 23 '19 at 19:01

score 1 · Answer 1 · answered Jun 23 '19 at 12:33

First of all, read related questions like Does std::wstring support UTF-16 and UTF-32 on Windows? and Is 16-bit wchar_t formally valid for representing full Unicode?.

If what you want is simply read/write strings as a blob for which you already know the encoding is UTF-16, without performing any conversion or manipulation, and you are in an environment like Visual Studio 2019 on Windows for which wchar_t is intended to hold UTF-16, then you can use the C++ wide strings and streams.

Now, if you need to perform conversions, support several encodings, iterate within strings (for some definitions of iterate), or in general anything non-trivial, you are out of luck at the moment if you want to stay within C++17. The C++ Standard committee has established a working group for Unicode, so expect to see some improvements in this area in the upcoming years. For the moment, you will need to use either Win32 functions like MultiByteToWideChar and WideCharToMultiByte, or an external library like International Components for Unicode (ICU) or Boost's Locale.

Hello, thank you for your answer, I'll try the `MultiByteToWideChar` and `WideCharToMultiByte` approach again and see, I tried it earlier, I couldn't get it to work. — , Jun 23 '19 at 13:41
@LukeWilliam You're welcome! They are not hard to use -- if you have trouble, please open another question with the code that fails and we can take a look :-) Typically, you will want to call `MultiByteToWideChar`/`WideCharToMultiByte` functions twice: first you find out the final length after conversion, then you allocate memory for it, and then you call it again to write the actual result into the buffer. — Acorn, Jun 23 '19 at 13:56

How to read a UTF-16 text file in C++17

1 Answers1

Linked