1

I have an XML document file.xml which is encoded in Iso-latin-15 (aka Iso-Latin-9)

<?xml version="1.0" encoding="iso-8859-15"?>
<root xmlns="http://stackoverflow.com/demo">
  <f>€.txt</f>
</root>

From my favorite text editor, I can tell this file is correctly encoded in Iso-Latin-15 (it is not UTF-8).

My software is written in C# and wants to extract the element f.

XmlDocument xmlDoc = new XmlDocument();
xmlDoc.Load("file.xml"); 

In real life, I have a XMLResolver to set credentials. But basically, my code is as simple as that. The loading goes smoothly, I don't have any exception raised.

Now, my problem when I extract the value:

//xnsm is the XmlNameSpace manager
XmlNode n = xmlDoc.SelectSingleNode("//root/f", xnsm); 
if (n != null)
  String filename = n.InnerText;

The Visual Studio debugger displays filename = □.txt

It could only be a Visual Studio bug. Unfortunately File.Exists(filename) returns false, whereas the file actually exist.

What's wrong?

rds
  • 26,253
  • 19
  • 107
  • 134
  • I have doubled-checked the encoding with Visual Studio. – rds Dec 09 '10 at 14:15
  • Have you tried, If the error occures too , if you are using a Stream for which you set the encoding manually? I would be careful with statements like "It could only be a Visual Studio bug"... – Hinek Dec 09 '10 at 14:24

3 Answers3

5

If I remember correctly the XmlDocument.Load(string) method always assumes UTF-8, regardless of the XML encoding.

You would have to create a StreamReader with the correct encoding and use that as the parameter.

xmlDoc.Load(new StreamReader(
                     File.Open("file.xml"), 
                     Encoding.GetEncoding("iso-8859-15"))); 

EDIT:

I just stumbled across KB308061 from Microsoft. There's an interesting passage:

Specify the encoding declaration in the XML declaration section of the XML document. For example, the following declaration indicates that the document is in UTF-16 Unicode encoding format:

<?xml version="1.0" encoding="UTF-16"?>

Note that this declaration only specifies the encoding format of an XML document and does not modify or control the actual encoding format of the data.

VVS
  • 19,405
  • 5
  • 46
  • 65
  • Thanks for the pointer. However, I cannot assume the input file is Iso-8859-15. – rds Dec 09 '10 at 14:31
  • I understand the `Load()` method pays attention to the xml header, as I thought. Their implementation of `XmlDocument` would have sucked, otherwise. – rds Dec 09 '10 at 14:50
3

Don't just use the debugger or the console to display the string as a string.

Instead, dump the contents of the string, one character at a time. For example:

foreach (char c in filename)
{
    Console.WriteLine("{0}: {1:x4}", c, (int) c);
}

That will show you the real contents of the string, in terms of Unicode code points, instead of being constrained by what the current font can display.

Use the Unicode code charts to look up the characters specified.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • Not really the answer, but that's definitely the good way for debugging this situation. Thanks. Now, I konw the problematic character is `:0080`. This is a control character in Unicode. Interesting enough, that's the Euro sympo in Windows-CP1252. I think string should be implemented in Unicode internally, this makes me think more and more that there is a bug in the XmlDocument implementation. – rds Dec 09 '10 at 14:52
  • @rds: Okay, so you now know it definitely hasn't decoded it properly. Next stop: what's in the file, in terms of bytes? And does .NET understand iso-8859-15 in general? – Jon Skeet Dec 09 '10 at 14:54
  • 1
    +1 I suspect the file is actually encoded in `windows-1252` and not `ISO-8859-15` at all. Does the euro character display when viewed in an XML viewer (eg a web browser)? Windows and .NET do support ISO-8859-15, but it's very rarely used. – bobince Dec 09 '10 at 15:02
  • Conclusion: Yes the input file has 0x80. – rds Dec 09 '10 at 15:17
  • @rds: Aha. Okay, that should be 0xA4 according to http://en.wikipedia.org/wiki/ISO/IEC_8859-15 – Jon Skeet Dec 09 '10 at 15:21
0
  1. Does your xml define its encoding correctly ? encoding="iso-8859-15" .. is that Iso-latin-15

  2. Ideally, you should put your content inside a CDATA element .. so the xml would look like <f><![CDATA[€.txt]]></f>

  3. Ideally, you should also escape all special characters with equivalent url-encoded (or http-encoded) values, because xml typically is for communicating through http.

I dont know the exact escape code for € .. but it would be something of this sort

<f><![CDATA[%3E.txt]]></f>

The above should make € be communicated correctly through the xml.

  • Ideally, you should put your code in code blocks, so that they are correctly displayed afterwards – Lucero Dec 09 '10 at 14:23
  • 1
    CDATA sections do nothing to help encoding issues. In fact, since they contain only raw character data, they prevent you using character references like `€` which is what you seem to be going for in (3). – bobince Dec 09 '10 at 14:52
  • Yes (2) is not specifically fixing the problem but is with intention of protecting more special chars if they will be in the value.. In (3) I purposefully put url encoding format example %3E (not a €) which should be decoded after extracting the value through code from xml. –  Dec 09 '10 at 15:19