0

I'm just beginning to learn about encoding issues, and I've learned just enough to know that it's far more complicated (on Windows, at least) than I had imagined, and that I have a lot more to learn.

I have an xml document that I believe is encoded with UTF-8. I'm using a VB.net app to transform the xml with (XslCompiledTransform and XmlTextWriter) into a column-specific text file. Some characters in the xml are coming out bad in the output text file. Example: an em-dash (—) is being turned into three characters "—". When that happens the columns in the file are thrown off.

As I understand it, an em-dash is not even a "Unicode character". I wouldn't expect to have issues with it. But, I can make the problem come and go by changing the encoding specified in the VB.net app.

If I use this, the em-dash is preserved:

Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(True))

If I use this, the em-dash is corrupted into "—":

Using writer = New XmlTextWriter(strOutputPath, New UTF8Encoding(False))

The True/False simply tells VB whether to write the BOM at the beginning of the file. As I understand it, the BOM is neither necessary nor recommended for UTF-8. So, my preference is False - but then I get the weird characters.

I have several questions:

  1. How can I be certain that the xml file is UTF-8? Is there a Windows tool that can tell me that?

  2. How can I be certain that the transformed file is actually bad? Could it be that the real problem is the editor I'm using to look at it? Both EmEditor and UltraEdit show the same thing.

  3. I've tried using the XVI32 hex editor to look at the file. I want to know what is actually written to disk, rather than what some GUI program is displaying to me. But, even on a file that looks good in EmEditor, XVI32 shows me the bad characters. Could it be that XVI32 just doesn't understand non-ASCII characters? What Windows hex editor would you recommend for this purpose?

The XML file is 650 MB, and the final text file is 380 MB - so that limits the list of useful tools somewhat.

Bruce Bacher
  • 165
  • 3
  • 5
  • 1
    Which encoding do you want for the text file XslCompiledTransform creates as the transformation result? How do you look at the text file, which application do you use? It sounds it recognizes the UTF-8 BOM and reads the file decoding it as UTF-8 but when you don't have the BOM it uses a different, probably 8 bit encoding like Windows-1252. So you either have to make sure you output the BOM or you have to change the code or application reading the text file to default to UTF-8. – Martin Honnen Dec 11 '13 at 21:01
  • @MartinHonnen I would like the text file encoded as UTF-8, if possible. I am using both EmEditor and UltraEdit to look at the file. So, you think the file is actually UTF-8? I will investigate whether EmEditor or UltraEdit have a way to force the encoding when a file is opened. Thanks. – Bruce Bacher Dec 11 '13 at 21:47
  • 1
    I think you get an UTF-8 encoded text file, yes, but then your editor uses an 8 bit encoding like a Windows code page or ISO-8859-x to decode the file. If there is no BOM an editor can't know a file is UTF-8 encoded and you have to tell the editor you want to read as UTF-8. – Martin Honnen Dec 11 '13 at 21:52

2 Answers2

1

You say 'As I understand it, an em-dash is not even a "Unicode character".' What do you mean by that? The Unicode character set definitely contains a code for em dash: 2014 hex. In the UTF-8 encoding, it will be 3 bytes: E2, 80, 94.

I suspect Martin Honnen is right that your editor is simply not showing the file properly. A couple of suggestions:

I'm not familiar with the editors you mention, but editors that handle different encodings will often silently choose an encoding by which to interpret the file (based on the BOM if there is one, and sometimes based on the character codes they see). They also typically have some way of showing what encoding they are interpreting the file as, and a way of tell them to load (or reload) the file as a particular encoding. If your editor does not have these features, I suggest you get one that does, such as EditPlus, or NotePad++.

As far as the hex editor, again I'm not familiar with the one you mention, but the whole point of a hex editor is to see the raw bytes. Such editors often offer a text view also (often side by side with the hex view) and if they do, I would not rely on their handling of encoding. Just use them to view the hex bytes and see if the bytes for your em dash are the same in both files.

Another way viewing the file can go wrong: even if your editor is interpreting the file as UTF-8, not all fonts will have all unicode characters in them, and for those characters not in the font they may display a little square or nothing at all. Try a few different fonts, or find one that purports to support unicode (though no font support ALL of Unicode, and there are several revisions of the Unicode spec which add more characters). Lucida Sans Unicode, I think, is one that will be on most Windows systems.

Another trick: I highly recommend the utility BabelMap. You can look up any unicode character there and see what the unicode value is, and you can copy the character from there and paste it into the file in your text editor and see how it displays it.

EricS
  • 569
  • 1
  • 3
  • 15
  • What I meant about em-dash not being a Unicode character is that you don't need to use Unicode to use an em-dash. Isn't it in extended ASCII range, or ANSI, or... I don't know, as I said I'm just scratching the surface of this stuff. I can't use Notepad++ because the file is too large - it gives me an error message saying that. I believe both EmEditor and UltraEdit have the capability to force a file to be opened using a particular encoding. I'll research that angle. What about the BOM? I have read that it is not necessary and not recommended for UTF-8. – Bruce Bacher Dec 12 '13 at 14:40
  • Using the Hex view in UltraEdit shows E2 80 94 where the em-dash is. If I understand what you wrote correctly, that confirms that the file is truly written as UTF-8, but I am viewing it badly. The GUI view of the file displays "—". – Bruce Bacher Dec 12 '13 at 15:24
  • 1
    em-dash definitely isn't in ASCII, but it is in Windows-1252 which some people loosely call ANSI, and sometimes gets confused with ASCII. Only 00-7F are properly ASCII and compatible with UTF-8. – EricS Dec 13 '13 at 02:19
  • As far as the BOM, as I understand it, it's optional so you don't need it. But it's definitely handy for editors that understand it. It might be distinctly unhandy if you try to open or process the file with something that doesn't understand it, and I think advice not to use it may based on that possibility. I think that was more of a concern in the past though. – EricS Dec 13 '13 at 02:23
  • Even though Notepad++ comes up with the "This file is too large" error, it still shows what it believes the encoding is: "ANSI as UTF-8". – Bruce Bacher Dec 13 '13 at 19:08
1

UltraEdit offers several configuration settings for working with UTF-8 encoded files. There is Auto detect UTF-8 files in File Handling - Unicode/UTF-8 Detection configuration dialog which is by default enabled.

With this setting enabled UltraEdit searches for the UTF-8 BOM. If not present, it searches in first few KB for an UTF-8 declaration as usually present in head of HTML/XHTML files or in first line of an XML file. If there is no BOM and no standardized encoding information at top of the file, UltraEdit searches within the first 64 KB for byte sequences which looks like UTF-8 encoded characters. If such a byte sequence can be found, the file is interpreted as UTF-8 encoded file by UltraEdit. For example a file containing only the 3 bytes E2 80 94 is interpreted as UTF-8 encoded file.

UltraEdit indicates in the status bar at bottom of the main window which encoding is detected and active (on save) for the active file. In the status bar there is either UTF-8 or U8- displayed depending on which status bar is used (advanced or basic) and which version of UltraEdit is used as the older versions have only the basic status bar.

Only files encoded in UTF-8 with no BOM, no UTF-8 character set or encoding declaration and no UTF-8 encoded character within the first 64 KB are opened wrong as ANSI file. In such cases the user can use the enhanced File - Open command of UltraEdit and explicitly select UTF-8 encoding before opening the file with button Open.

For completness, there is also a configuration setting which can be manually added to uedit32.ini which results in opening all files not being detected as UTF-16 files as UTF-8 encoded files. This is a setting useful for those who want to work only with UTF-8 encoded files even if there are very often no characters in a file present with a code value greater than 127.

For more information about working with UTF-8 encoded files take a look in the UltraEdit forums. There are a few topics with lots of information about editing UTF-8 encoded files in UltraEdit.

Mofi
  • 46,139
  • 17
  • 80
  • 143