How to properly convert USC-2 little endian into UTF-8?

Question

I have a file and the line endings are in the windows style \r\n; it is encoded in USC-2 little endian.

Say this is my file fruit.txt (USC-2 little endian):

So I open it in a std::wifstream and try to parse the contents:

// open the file
    std::wifstream file("fruit.txt");
    if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));

// create container for the lines
    std::forward_list<std::string> lines;

// Add each line to the container
    std::wstring line;
    while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));

If I try to print to cout...

// Printing to cout
    for( auto it = lines.cbegin(); it != lines.cend(); ++it )
        std::cout << *it << std::endl;

...This is what it outputs:

Cherry
Banana
ÿþApple

Worse yet, if I open it in Notepad++, this is what it looks like

I can sort-of rectify this by forcibly converting the encoding back to USC-2 which results in this:

My wstring_to_string function is defined as this:

std::string wstring_to_string( const std::wstring& wstr ) {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
    return convert.to_bytes(wstr);
}

What in the world is going on here? How can I get a normal UTF-8 string? I have tried this method too: How to read utf-16 file into utf-8 std::string line by line, but imbuing the std::wifstream first results in no outputs altogether. Can someone please help direct me in the best way to go about converting USC-2 LE data to readable UTF-8 data?

Edit I think there may be a bug with mingw64/mingw-w64-x86_64-gcc 6.3.0-2 which is provided by MSYS2. I have tried everyone's suggestions and imbuing the locale into the streams is just rendering no output at all. I do know there are only two native locales provided, "C" and "POSIX". I was going to try Visual Studio but don't have sufficient internet speed for the 4GB download. I have used ICU like @Andrei R. suggested and it is working great.

I would have loved to use standard libraries but I am ok with this. Please take a look at my code if you need this solution: https://pastebin.com/qudy7yva

Is this Windows? Did you get the NP++ picture by copying console text to the editor? (And before someone makes a remark that NP++ is a Windows program, it runs fine on Wine) — deviantfan, Apr 12 '17 at 03:03
Yes, this is Windows. I got the log.txt by running my program like so: ./program.exe > log.txt. I am using MSYS2's g++ 6.3.0 — Dustin Nieffenegger, Apr 12 '17 at 03:35
Well, then you should know that the Windows console (for all versions of Windows) just can't handle UTF8. Some stuff works out of the box, for some stuff there are workarounds, but 100% correct behaviour is just impossible (eg. because of a some CRT bugs that they have no intention to fix (because too much work)). The > redirection is not a part of your own program, so I wouldn't rely on it too much. — deviantfan, Apr 12 '17 at 03:56
Good information, thanks. Does that apply to the MSYS2 console though? (It comes with its own). I will try writing the file directly and let you know how it goes. — Dustin Nieffenegger, Apr 12 '17 at 04:20
Rewriting the console core and all related CRT stuff for no reason wouldn't be something Msys2 does. For supporting bash syntax etc., it's not needed, and much harder than it. — deviantfan, Apr 12 '17 at 04:44
OK I used a `std::ofstream` to output to a file, and it looked exactly like the 2nd screenshot, no difference... — Dustin Nieffenegger, Apr 12 '17 at 06:30

Remy Lebeau · Answer 1 · 2017-04-12T03:37:55.207

1

The code itself is fine as-is.

The real problem is that your input file is NOT valid UTF-16LE to begin with (your use of std::codecvt_utf8_utf16 requires UTF-16, not UCS-2). This is clearly shown in your Notepad++ screenshots.

Offhand, the file data looks like a UTF-16LE file with a BOM (ÿþ is the UTF-16LE BOM when viewed as 8bit ANSI) was appended as-is to the end of a UCS-2BE (or UTF-16BE) file that did not have a BOM.

You need to fix the input file so the entire file is valid UTF-16LE from beginning to end (with or without a BOM in front, not in the middle).

Then the code you already have will work.

edited Apr 12 '17 at 03:37

answered Apr 12 '17 at 03:31

Remy Lebeau

555,201
31
458
770

1

`The real problem is that your input file is NOT... is clearly shown in your Notepad++ screenshots`. I think the screenshots are from the output. – deviantfan Apr 12 '17 at 03:57
...and the new screenshot (input this time) looks ok. – deviantfan Apr 12 '17 at 04:04
Is this not valid USC-2/UTF-16 LE? https://drive.google.com/file/d/0B8-ysHxtvszydlA0cFJUVXFFSEU/view?usp=sharing – Dustin Nieffenegger Apr 12 '17 at 06:48
@deviantfan good catch about the screenshots being the output, not the input. – Remy Lebeau Apr 12 '17 at 09:01
@DustinGoodson that file is perfectly fine. I can't see any possible way that input file with the code shown can produce the output shown in the screenshots. But it is clear that the `std::wifstream` is not swallowing the input BOM. You need to `imbue` a locale whose facet enables the `std::consume_header` flag. See [this answer](http://stackoverflow.com/a/26153212/65863) for an example. – Remy Lebeau Apr 12 '17 at 09:11
@RemyLebeau As I stated before, unfortunately when I imbue the stream it outputs nothing at all. Empty file. Please see my updated file to see if I did something wrong. https://pastebin.com/VbLaGFgz (p.s. I wasn't sure about the second imbue statement, however, the output is still empty without it) – Dustin Nieffenegger Apr 12 '17 at 13:21
@DustinGoodson you are using the wrong `codecvt` class for the UTF-16 to UTF-8 conversion (go back to using `std::codecvt_utf8_utf16`), and don't imbue the output `ofstream` with the "C" locale at all (if anything, you should have imbued it with a UTF-8 locale instead). Better is to just open the output file in `std::ios::binary` mode instead to let it write the `std::string` data as-is. – Remy Lebeau Apr 12 '17 at 15:21

score 0 · Accepted Answer · answered Apr 12 '17 at 02:40

0

converting to/from unicode is in general not so trivial. Have a look at ICU libraries, I believe, this is by far most complete encoding conversion library for c/c++.

There are also platform-dependent ways like WideCharToMultibyte (Win) or iconv (Linux). Or, with Qt, you can use QString::fromUtf16. Probably you will have to reverse endianness yourself.

answered Apr 12 '17 at 02:40

Andrei R.

2,374
1
13
27

1

`converting to/from unicode is in general not so trivial.` This is a conversion from Unicode to Unicode... that is manageable without ICU – deviantfan Apr 12 '17 at 02:56

score 0 · Answer 3 · edited May 23 '17 at 11:46

For your case, the main issue is that you made the wifstream read the file in a wrong way. If you print the size of wstr in wstring_to_string, you will find that it's not what you expect.

https://stackoverflow.com/a/19698449/4005852

Set proper locale will fix this issue.

std::string wstring_to_string( const std::wstring& wstr ) {
    std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> convert;
    return convert.to_bytes(wstr);
}

int main()
{
// open the file
    std::wifstream file("fruit.txt", std::ios::binary);
    file.imbue(std::locale(file.getloc(),
          new std::codecvt_utf16<wchar_t, 0x10ffff, std::little_endian>));
    if( ! file.is_open() ) throw std::runtime_error(std::strerror(errno));

// create container for the lines
    std::forward_list<std::string> lines;

// Add each line to the container
    std::wstring line;
    file.get(); // remove BOM
    while(std::getline(file,line)) lines.emplace_front(wstring_to_string(line));

// Printing to cout
    for( auto it = lines.cbegin(); it != lines.cend(); ++it )
        std::cout << *it << std::endl;

    return 0;
}

I got no output at all from that. I am beginning to think this a compiler bug :/ — Dustin Nieffenegger, Apr 13 '17 at 17:14
I'm using "Microsoft (R) C/C++ Optimizing Compiler Version 19.00.24210 for x64". What's your compiler? — Cong Ma, Apr 13 '17 at 17:20
Okay. I am currently downloading Visual Studio to try with another compiler. I usually use g++ from MSYS2 — Dustin Nieffenegger, Apr 13 '17 at 17:21

How to properly convert USC-2 little endian into UTF-8?

3 Answers3