XML Parsing Problem in German Culture- ASP.NET

Question

Coding Platform: ASP.NET WebForms 4.0 with C#

Background: I am reading some values from XML and everything was working in my locale (en-US). The XML looks like this

<?xml version="1.0" encoding="utf-32" ?>
<settings>
  <UserRegistration>AutoAuthorize</UserRegistration>
  <OpenIDProfile>PromptUser</OpenIDProfile>
  <EnableSpamProtection>Yes</EnableSpamProtection>
  <MaxAllowedOpenID>2</MaxAllowedOpenID>
  <WebsiteURL>http://localhost:70707/blah/</WebsiteURL>
  <FacebookOAuthURL>https://graph.facebook.com/oauth/authorize?</FacebookOAuthURL>
  <FacebookAccessTokenURL>https://graph.facebook.com/oauth/access_token?</FacebookAccessTokenURL>
  <FacebookRedirectPage>ausgefüllt.aspx</FacebookRedirectPage>
  <FacebookAppID>192328104139846</FacebookAppID>
  <FacebookAppKey>29daeb58d8ae84cc22181f4073e4ed9d</FacebookAppKey>
  <FacebookAppSecret>b94e9ddd20efc47b3227e7333925fdd8</FacebookAppSecret>
  <FacebookScope>email</FacebookScope>
  <EmailSettingsDisplayName>admin</EmailSettingsDisplayName>
  <EmailSettingsEmail>blah@blah.com</EmailSettingsEmail>
  <EmailSettingsPassword>192185135098207157230060249027191124199097098215</EmailSettingsPassword>
</settings>

Problem

I wrapped the whole thing to my client for testing. The testing environment is

Server: Windows Server 2008 R2 64 bit
Locale: German (de-DE)

And now, when I try to read the XML, Elmah throws two errors error. The first error is

System.Xml.XmlException: '', hexadecimal value 0xA000D, is an invalid character. Line 1, position 40. at System.Xml.XmlTextReaderImpl.Throw(String res, String[] args) at System.Xml.XmlTextReaderImpl.ParseRootLevelWhitespace() at System.Xml.XmlTextReaderImpl.ParseDocumentContent() at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options) at System.Xml.Linq.XDocument.Load(String uri, LoadOptions options) at Administrator_SiteSettings.SaveSettingsButton_Click(Object sender, EventArgs e) in c:\Webs\ThirdPartyLogins\Administrator\SiteSettings.aspx.cs:line 48

I am taking these XML node values to a Dictionary and this error follows with a key not found error for the dictionary.
Is encoding the culprit?
What could be wrong in my code?

Update: Just read UTF-8, UTF-16, and UTF-32. Will changing to utf-8 help?

Update2: Two things that might clarify the issue more.

1) On changing the encoding to utf-16, got a new error

at utf-16 its System.Xml.XmlException: '.', hexadecimal value 0x00, is an invalid character. Line 1, position 39.

2) The XML pasted earlier was not complete. It had some more nodes with some URL as node data. Will that be an issue? Have updated XML also.

It seems like an encoding problem. Are sure the XML file actually is in UTF32? — svick, Apr 02 '11 at 15:30

Aaron · Answer 1 · 2011-05-13T13:58:18.737

Short answer: Yes, the encoding is the culprit; the correct encoding is utf-16.

Long answer: The clue lies in the exception text, where it says "hexidecimal value 0xA000D" and "line 1, position 40".

When XmlReader reads your file, it first reads the XML declaraction (everything between <?xml and ?>) to determine which encoding to use for the rest of the file. In this case the declaration says UTF-32. So immediately after reading the > character at the end of the declaration, it switches to using UTF-32 encoding. As your linked article explains, UTF-32 uses 4 bytes to represent each character, so the XmlReader reads the next 4 bytes from the file and tries to interpret them as a character. (This lines up with your error message, since line 1 position 40 is immediately after the > character.)

If the file really were UTF-32, what would the next 4 bytes be? Well, the next thing in the file after the > character is a newline, which is made up of two characters, carriage return and linefeed (in Unicode, 0D and 0A respectively). So we would expect the next 4 bytes to be 0D 00 00 00, and the next 4 after that would be 0A 00 00 00 (remember, Windows is little-endian).

But as the error message states, the actual "character" read was A000D, which means the next 4 bytes were 0D 00 0A 00 (again, remember little-endian). That's pretty close, but apparently only 2 bytes are being used for each character instead of 4. Well we have a name for that, don't we? It's called UTF-16!

+1 excellent explanation. thanks. i figured since i am using german language ( with umlaut ) it would be better off giving utf-8. your thoughts? — naveen, Apr 02 '11 at 16:57
at utf-16 its System.Xml.XmlException: '.', hexadecimal value 0x00, is an invalid character. Line 1, position 39. — naveen, Apr 02 '11 at 17:50
Hmm, well at this point it would help to be able to see the exact bytes that make up your XML file. Is there some way you could post that here (e.g. base64 encode it and append to your question, or zip the file, upload the zip somewhere, and link to it)? — Aaron, Apr 04 '11 at 18:02

XML Parsing Problem in German Culture- ASP.NET

1 Answers1