8

I've been using StackOverflow since the beginning, and have on occasion been tempted to post questions, but I've always either figured them out myself or found answers posted eventually... until now. This feels like it should be fairly simple, but I've been wandering around the internet for hours with no success, so I turn here:

I have a pretty standard utf-16 text file, with a mixture of English and Chinese characters. I would like those characters to end up in a string (technically, a wstring). I've seen a lot of related questions answered (here and elsewhere), but they're either looking to solve the much harder problem of reading arbitrary files without knowing the encoding, or converting between encodings, or are just generally confused about "Unicode" being a range of encodings. I know the source of the text file I'm trying to read, it will always be UTF16, it has a BOM and everything, and it can stay that way.

I had been using the solution described here, which worked for text files that were all English, but after encountering certain characters, it stopped reading the file. The only other suggestion I found was to use ICU, which would probably work, but I'd really rather not include a whole large library in an application for distribution, just to read one text file in one place. I don't care about system independence, though - I only need it to compile and work in Windows. A solution that didn't rely on that fact would prettier, of course, but I would be just as happy for a solution that used the stl while relying on assumptions about Windows architecture, or even solutions that involved win32 functions, or ATL; I just don't want to have to include another large 3rd-party library like ICU. Am I still totally out of luck unless I want to reimplement it all myself?

edit: I'm stuck using VS2008 for this particular project, so C++11 code sadly won't help.

edit 2: I realized that the code I had been borrowing before didn't fail on non-English characters like I thought it was doing. Rather, it fails on specific characters in my test document, among them ':' (FULLWIDTH COLON, U+FF1A) and ')' (FULLWIDTH RIGHT PARENTHESIS, U+FF09). bames53's posted solution also mostly works, but is stumped by those same characters?

edit 3 (and the answer!): the original code I had been using -did- mostly work - as bames53 helped me discover, the ifstream just needed to be opened in binary mode for it to work.

neminem
  • 2,658
  • 5
  • 27
  • 36
  • 1
    Please show us some code. What actual API are you calling? ReadFile? fread? read? – bmargulies May 08 '12 at 18:19
  • There shouldn't be a problem if you're actually certain that the text is UTF16. To the best of my knowledge, Chinese typically ends up as an MBCS string, which is an entirely different beast. – Mahmoud Al-Qudsi May 08 '12 at 18:25
  • 3
    _wfopen can open/translate UTF-16 which can then be read into a string by fread http://msdn.microsoft.com/fr-fr/library/yeby3zcb%28v=vs.80%29.aspx – Benj May 08 '12 at 18:25
  • I don't see any reason why the code you linked to shouldn't work. It reads a file of bytes and type-casts it to `wchar_t*` to initialize a `wstring`. The only thing I'd check is if the file is opened in binary mode, but I wouldn't expect a mistake there to show your symptom. – Mark Ransom May 08 '12 at 20:53
  • @MarkRansom See my response to bames53's post: I now have a better idea just -what- odd symptom it is that that code we had previously been using was displaying: certain specific unicode characters stopped it reading before it had read the whole file. Not enough of a unicode expert to guess -why-, though. – neminem May 08 '12 at 21:20
  • @bmargulies (and whoever voted that comment up): I linked to the code I had previously been using, which was stl (ifstream/stringstream). I'm -not- tied to a particular API, though, long as it's one I have access to. – neminem May 08 '12 at 21:33
  • @MahmoudAl-Qudsi: I'm pretty darn sure it's UTF16. Or at least, I'm pretty sure it's a text file that looks a lot like UTF16, and is definitely not MBCS. I can't prove, for instance, that it isn't actually UCS-2 (I knew nothing about that encoding or its differences from UTF16 until today.) – neminem May 08 '12 at 21:35

3 Answers3

11

The C++11 solution (supported, on your platform, by Visual Studio since 2010, as far as I know), would be:

#include <fstream>
#include <iostream>
#include <locale>
#include <codecvt>
int main()
{
    // open as a byte stream
    std::wifstream fin("text.txt", std::ios::binary);
    // apply BOM-sensitive UTF-16 facet
    fin.imbue(std::locale(fin.getloc(),
       new std::codecvt_utf16<wchar_t, 0x10ffff, std::consume_header>));
    // read     
    for(wchar_t c; fin.get(c); )
            std::cout << std::showbase << std::hex << c << '\n';
}
Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • 6
    On platforms with a two byte wchar_t like Windows this will convert from UTF-16 to UCS-2. Specifically the VS2010 implementation truncates characters outside the BMP. – bames53 May 08 '12 at 19:12
  • 1
    @bames53 Indeed.. VS2010 reads those characters into `char32_t` correctly, but there's not a lot that can be done with a UCS4 string on Windows. It's probably too early to get rid of compiler-dependent stuff like `_O_U16TEXT`. – Cubbi May 08 '12 at 19:26
  • Annoying, I tried your snippet, and while at first I thought it wasn't working (when I saw it print integers rather than unicode characters), then I noticed that was what it was supposed to be doing. I replaced the cout with appending to a wstring, and saw the unicode string I was expecting to see. I say "annoyingly" because I hadn't thought it was important to mention I'm stuck at vs2008 for this particular project, until now. (I have so edited my question.) This is still a correct answer, though, assuming you're allowed to use C++11. Or barring characters outside the BMP it is, anyway. – neminem May 08 '12 at 20:38
  • do you know how writing back to file goes? I try : std::wofstream wofs("/utf16dump.txt"); wofs.imbue(std::locale(wofs.getloc(), new std::codecvt_utf16)); wofs << ws; and I get garbage – NoSenseEtAl Jun 08 '12 at 15:18
  • @NoSenseEtAl works for me, produces UTF-16be, as requested (using clang++/libcxx). Perhaps you needed `std::little_endian`? – Cubbi Jun 08 '12 at 15:56
  • 1
    std::consume_header doesn't seem to work in VS2010 -- BOM is consumed, but byte order is not affected. I had to explicitly use std::little_endian too. – Eugene May 07 '13 at 23:14
  • Why do you open the file in the binary mode? – hkBattousai Apr 01 '16 at 09:21
  • @hkBattousai because I don't want the read to terminate if it runs into `\x1a`. Windows is crazy like that. – Cubbi Apr 01 '16 at 10:25
  • For readers, replace the last line with `std::wcout << c << '\n';` to see Unicode characters output. – zar Oct 28 '19 at 19:33
  • 2
    Note that on macOS I had to explicitly set `std::little_endian` instead of `std::consume_header` for a file encoded as UTF-16 LE that included the respective BOM. Otherwise I would receive big endian output. – bfx May 27 '20 at 10:24
  • MSVC's version say this use of std::codecvt is deprecated in C++ 17, see `_CXX17_DEPRECATE_CODECVT_HEADER`. I don't see this mentioned here: https://en.cppreference.com/w/cpp/locale/codecvt – Chris Guzak Jun 30 '21 at 04:20
  • 1
    @ChrisGuzak std::codecvt was not deprecated. The codecvt header and its contents were - cppreference notes that on https://en.cppreference.com/w/cpp/locale#Locale-independent_unicode_conversion_facets and individual pages – Cubbi Jul 16 '21 at 17:43
9

When you open a file for UTF-16, you must open it in binary mode. This is because in text mode, certain characters are interpreted specially - specifically, 0x0d is filtered out completely and 0x1a marks the end of the file. There are some UTF-16 characters that will have one of those bytes as half of the character code and will mess up the reading of the file. This is not a bug, it is intentional behavior and is the sole reason for having separate text and binary modes.

For the reason why 0x1a is considered the end of a file, see this blog post from Raymond Chen tracing the history of Ctrl-Z. It's basically backwards compatibility run amok.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
5

Edit:

So it appears that the issue was that the Windows treats certain magic byte sequences as the end of the file in text mode. This is solved by using binary mode to read the file, std::ifstream fin("filename", std::ios::binary);, and then copying the data into a wstring as you already do.



The simplest, non-portable solution would be to just copy the file data into a wchar_t array. This relies on the fact that wchar_t on Windows is 2 bytes and uses UTF-16 as its encoding.


You'll have a bit of difficulty converting UTF-16 to the locale specific wchar_t encoding in a completely portable fashion.

Here's the unicode conversion functionality available in the standard C++ library (though VS 10 and 11 implement only items 3, 4, and 5)

  1. codecvt<char32_t,char,mbstate_t>
  2. codecvt<char16_t,char,mbstate_t>
  3. codecvt_utf8
  4. codecvt_utf16
  5. codecvt_utf8_utf16
  6. c32rtomb/mbrtoc32
  7. c16rtomb/mbrtoc16

And what each one does

  1. A codecvt facet that always converts between UTF-8 and UTF-32
  2. converts between UTF-8 and UTF-16
  3. converts between UTF-8 and UCS-2 or UCS-4 depending on the size of target element (characters outside BMP are probably truncated)
  4. converts between a sequence of chars using a UTF-16 encoding scheme and UCS-2 or UCS-4
  5. converts between UTF-8 and UTF-16
  6. If the macro __STDC_UTF_32__ is defined these functions convert between the current locale's char encoding and UTF-32
  7. If the macro __STDC_UTF_16__ is defined these functions convert between the current locale's char encoding and UTF-16

If __STDC_ISO_10646__ is defined then converting directly using codecvt_utf16<wchar_t> should be fine since that macro indicates that wchar_t values in all locales correspond to the short names of Unicode charters (and so implies that wchar_t is large enough to hold any such value).

Unfortunately there's nothing defined that goes directly from UTF-16 to wchar_t. It's possible to go UTF-16 -> UCS-4 -> mb (if __STDC_UTF_32__) -> wc, but you'll loose anything that's not representable in the locale's multi-byte encoding. And of course no matter what, converting from UTF-16 to wchar_t will lose anything not representable in the locale's wchar_t encoding.


So it's probably not worth being portable, and instead you can just read the data into a wchar_t array, or use some other Windows specific facility, such as the _O_U16TEXT mode on files.

This should build and run anywhere, but makes a bunch of assumptions to actually work:

#include <fstream>
#include <sstream>
#include <iostream>

int main ()
{
    std::stringstream ss;
    std::ifstream fin("filename");
    ss << fin.rdbuf(); // dump file contents into a stringstream
    std::string const &s = ss.str();
    if (s.size()%sizeof(wchar_t) != 0)
    {
        std::cerr << "file not the right size\n"; // must be even, two bytes per code unit
        return 1;
    }
    std::wstring ws;
    ws.resize(s.size()/sizeof(wchar_t));
    std::memcpy(&ws[0],s.c_str(),s.size()); // copy data into wstring
}

You should probably at least add code to handle endianess and the 'BOM'. Also Windows newlines don't get converted automatically so you need to do that manually.

bames53
  • 86,085
  • 15
  • 179
  • 244
  • Well, turns out, your code helped me debug - it stopped reading in exactly the same place in my sample text file as the code I linked to - (http://cfc.kizzx2.com/index.php/reading-a-unicode-utf16-file-in-windows-c - did. Turns out it wasn't stopping at a Chinese character, it stopped reading at the first instance of a : (FULLWIDTH COLON, U+FF1A) character. Removing that, it then stops at ) (FULLWIDTH RIGHT PARENTHESIS, U+FF09). I'm sensing a theme... – neminem May 08 '12 at 21:17
  • 1
    @neminem I guess I should have looked more closely at that link, it's just doing the same thing as I show. I'm guessing that for whatever reason, the VS 2008 implementation of fstream does not like reading the byte 0xFF. That byte represents 'delete'. Try opening the file in binary mode `std::ifstream fin("...",std::ios::binary);` – bames53 May 08 '12 at 21:32
  • 2
    Oh my frelling god. I spent over a day trying to figure it out, and it was that obvious? I tried -other- things that involved opening the file in binary mode, but I never tried the -original- solution only opening it in binary mode? You win so much. You should edit that into your solution, in case other people stumble on this question later (I can't imagine I'm the only person who's ever had this issue) :). – neminem May 08 '12 at 21:39
  • It's not a bug - see my answer. – Mark Ransom May 09 '12 at 13:38
  • @MarkRansom That makes sense, though I'd have expected it to only have an effect on Windows when 0x0D and 0x0A appear together. The 0x1A seems like a bug by design, but since none of this stuff is standardized it's probably best to never use text mode anywhere. – bames53 May 09 '12 at 14:25