This smelled a lot like a text encoding problem, so I went ahead and tried running the command you provided, and sure enough, the output file is encoded in UCS16LE. (That's 16-bit chars, little-endian.) Try opening the file in a hex editor to see what it actually looks like.
You were on the right path when trying to use wide strings, but dealing with Unicode can be tricky. The next few paragraphs will give you some tips on how to deal with this the hard way, but if you need a quick and easy solution, jump to the end.
There's two thing to be careful of. First, make sure you're also using the wide streams, like wcout. It's worth casting each character to an int to double-check that there isn't a problem with the output formatting.
Second, the format of wcout, wstring, etc, is not standard. On some compilers it's 2 bytes per char, and on others it's 4. You can usually change this in your compiler settings. C++11 also provides std::u16string and std::u32string, which are more explicit about their size.
Reading Unicode text can unfortunately be quite a bit of a hassle with the C++ library, because even if you have the right string size, you need to deal with BOMs and endian formats, not to mention canonicalization.
There's libraries to help with this, but the simplest solution might just be to open the txt file in Notepad, choose Save As, then choose an encoding you're more comfortable with, like ANSI.
Edit: If you're not happy with the quick and dirty solution, and you don't want to use a better Unicode library, you can do this with the standard library, but only if you're using a compiler that supports C++11, such as Visual Studio 2012.
C++11 added some codecvt
facets to handle converting between different Unicode file types. This should suit your purpose, but the underlying design of this part of the library was designed in the days or yore, and can be rather difficult to understand. Hold on to your pants.
Below the line where you open your ifstream
, add this code:
infoFile.imbue(std::locale(infoFile.getloc(), new std::codecvt_utf16<char, 0x10FFFF, std::consume_header>));
I know that looks a bit scary. What it's doing is making a "locale" from a copy of the existing locale, then adding a "facet" to the locale which handles the format conversion.
"Locales" handle a whole bunch of stuff, mostly related to localization (such as how to punctuate currency, eg "100.00" vs "100,00"). Each of the rules in the locale is called a facet. In the C++ standard library, file encoding is treated as one of these facets.
(Background: In retrospect, it probably wasn't a very wise idea to mix file encoding up with localization, but at the time this part of the library was designed, file encoding was typically dictated by the language of the program, so that's how we got into this situation.)
So the locale
constructor above is taking a copy of the default locale
created by the file stream as its first parameter, and the second parameter is the new facet to use.
codecvt_utf16
is a facet for converting to and from utf-16. The first parameter is the "wide" type, which is to say, the type used by the program, rather than the type used in the byte stream. I specified char
here, and that works with Visual Studio, but it's not actually valid according to the standard. I'll get to that later.
The second parameter is the maximum Unicode value you want to accept without throwing an error, and for the foreseeable future, 0x10FFFF represents the largest Unicode character.
The final parameter is a bitmask that changes the behaviour of the facet. I thought std::consume_header
would be particularly useful for you, since wmic
outputs a BOM (at least on my machine). This will consume that BOM, and chose whether to treat it as a little- or big-endian stream depending on what it gets.
You'll also notice that I'm creating the facet on the stack with new
, but I'm not calling delete
anywhere. This is not a very safe way to design a library in modern C++, but like I said, locales are a rather old part of the library.
Rest assured that you don't need to delete
this facet. This isn't really documented very well (since locales are so rarely used in practice), but a default-constructed facet will be automatically delete
d by the locale it's attached to.
Now, remember how I said it's not valid to use char
as the wide type? The standard says you have to use whcar_t
, char16_t
or char32_t
, and if you want to support non-ASCII characters, you'll definitely want to do this. The easiest way to make this valid would be to use wchar_t
, change ifstream
, string
, cout
, and istringstream
to wifstream
, wstring
, wcout
, and wistringstream
, then make sure your strings/char constants have an L
in front of them, like so:
std::wcout << L"\nLine #" << lineNum << L":" << line << std::endl;
Those are all the changes you need in order to use wide strings. However, also beware that the Windows console cannot handle non-ANSI characters, so if you try to output such a character (when I ran the code I hit a ™ character), the wcout stream will be invalidated and stop outputting anything. If you're outputting to a file, this shouldn't be a problem.
You can probably tell that I'm not particularly thrilled about this part of the standard library. In practice, most people who want to use Unicode will use a different library (like the ones I mentioned in the comments), or roll their own encoders/decoders.