I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.
I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.
As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.
If I use this, the em-dash is preserved:
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))
If I use this, the em-dash is corrupted into "—":
Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))
The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.
I have several questions:
How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?
How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.
I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?
The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.