14

I am currently writing an application which requires me to call GetWindowText on arbitrary windows and store that data to a file for later processing. Long story short, I noticed that my tool was failing on Battlefield 3, and I narrowed the problem down to the following character in its window title: http://www.fileformat.info/info/unicode/char/2122/index.htm

So I created a little test app which just does the following:

std::wcout << L"\u2122";

Low and behold that breaks output to the console window for the remainder of the program.

Why is the MSVC STL choking on this character (and I assume others) when APIs like MessageBoxW etc display it just fine?

How can I get those characters printed to my file?

Tested on both VC10 and VC11 under Windows 7 x64.

Sorry for the poorly constructed post, I'm tearing my hair out here.

Thanks.

EDIT:

Minimal test case

#include <fstream>
#include <iostream>

int main()
{
  {
    std::wofstream test_file("test.txt");
    test_file << L"\u2122";
  }

  std::wcout << L"\u2122";
}

Expected result: '™' character printed to console and file. Observed result: File is created but is empty. No output to console.

I have confirmed that the font I"m using for my console is capable of displaying the character in question, and the file is definitely empty (0 bytes in size).

EDIT:

Further debugging shows that the 'failbit' and 'badbit' are set in the stream(s).

EDIT:

I have also tried using Boost.Locale and I am having the same issue even with the new locale imbued globally and explicitly to all standard streams.

ST3
  • 8,826
  • 3
  • 68
  • 92
RaptorFactor
  • 2,810
  • 1
  • 29
  • 36

4 Answers4

22

To write into a file, you have to set the locale correctly, for example if you want to write them as UTF-8 characters, you have to add

const std::locale utf8_locale
            = std::locale(std::locale(), new std::codecvt_utf8<wchar_t>());
test_file.imbue(utf8_locale);

You have to add these 2 include files

#include <codecvt>
#include <locale>

To write to the console you have to set the console in the correct mode (this is windows specific) by adding

_setmode(_fileno(stdout), _O_U8TEXT);

(in case you want to use UTF-8).

For this you have to add these 2 include files:

#include <fcntl.h>
#include <io.h>

Furthermore you have to make sure that your are using a font that supports Unicode (such as for example Lucida Console). You can change the font in the properties of your console window.

The complete program now looks like this:

#include <fstream>
#include <iostream>
#include <codecvt>
#include <locale>
#include <fcntl.h>
#include <io.h>

int main()
{

  const std::locale utf8_locale = std::locale(std::locale(),
                                    new std::codecvt_utf8<wchar_t>());
  {
    std::wofstream test_file("c:\\temp\\test.txt");
    test_file.imbue(utf8_locale);
    test_file << L"\u2122";
  }

  _setmode(_fileno(stdout), _O_U8TEXT);
  std::wcout << L"\u2122";
}
BertR
  • 1,657
  • 11
  • 12
  • 1
    Well I'll be damned, imbuing that UTF8 locale actually worked... Now why the hell isn't Boost.Locale doing that for me? I interpreted the docs as saying that UTF-8 is assumed to be the default narrow encoding, and I've imbued the locale globally and to all static streams, so what the hell... – RaptorFactor Mar 26 '12 at 12:22
3

Are you always using std::wcout or are you sometimes using std::cout? Mixing these won't work. Of course, the error description "choking" doesn't say at all what problem you are observing. I'd suspect that this is a different problem to the one using files, however.

As there is no real description of the problem it takes somewhat of a crystal ball followed by a shot in the dark to hit the problem... Since you want to get Unicode characters from you file make sure that the file stream you are using uses a std::locale whose std::codecvt<...> facet actually converts to a suitable Unicode encoding.

Dietmar Kühl
  • 150,225
  • 13
  • 225
  • 380
  • I am always using wide types and apis. Even something as simple as the line I posted in my question fails on my platform. Ditto if you replace wcout with a wofstream. – RaptorFactor Mar 25 '12 at 12:18
  • I have added a minimal test case. – RaptorFactor Mar 25 '12 at 12:23
  • Did you verify that the `std::codecvt` used by the default `std::locale` uses a Unicode aware encoding? Boost seems to have a [UTF-8 facet](http://www.boost.org/doc/libs/1_49_0/libs/serialization/doc/codecvt.html). I'd suspect that the `std::wcout` on your platform uses a `std::basic_filebuf` i.e. it would work for both files and consoke output. – Dietmar Kühl Mar 25 '12 at 15:00
2

I just tested GCC (versions 4.4 thru 4.7) and MSVC 10, which all exhibit this problem.

Equally broken is wprintf, which does as little as the C++ stream API.

I also tested the raw Win32 API to see if nothing else was causing the failure, and this works:

#include <windows.h>
int main()
{ 
    HANDLE stdout = GetStdHandle(STD_OUTPUT_HANDLE);
    DWORD n;
    WriteConsoleW( stdout, L"\u03B2", 1, &n, NULL );
}

Which writes β to the console (if you set cmd's font to something like Lucida Console).

Conclusion: wchar_t output is horribly broken in both large C++ Standard library implementations.

rubenvb
  • 74,642
  • 33
  • 187
  • 332
  • 2
    It's not horribly broken, just horribly documented. – Mark Ransom Mar 26 '12 at 01:24
  • What would you say my options are? A rewrite to use the raw API would involve thousands of lines of code. Boost.Locale didn't seem to solve the problem either... – RaptorFactor Mar 26 '12 at 03:00
  • I don't have Nicolai Josuttis' [`The C++ Standard Library`](http://www.josuttis.com/libbook/) handy, but it's the definite book on the subject. And considering that the IOStreams bit is co-written by Dietmar Kühl ;) , it does cover the whole character conversion stuff in IOStream quite well. – MSalters Mar 26 '12 at 09:37
1

Although the wide character streams take Unicode as input, that's not what they produce as output - the characters go through a conversion. If a character can't be represented in the encoding that it's converting to, the output fails.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • That seems so 'wrong' (for lack of a better word). I'm not sure I understand how to actually work around/fix what you're saying though... – RaptorFactor Mar 26 '12 at 07:15
  • I don't think it's true, either. `std::wstringstream` definitely is a wide character stream (inherits from `std::wstream`), but doesn't do any conversion. – MSalters Mar 26 '12 at 09:40