0

I'm importing data from XML files containing this type of content:

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.

I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?


Update: I do not have any encoding declaration or <?xml in this document.

I've seen some links say to add it dynamically? Is this UTF-16 encoding?

John Farrell
  • 24,673
  • 10
  • 77
  • 110

4 Answers4

3

It appears that:

  • The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
  • The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
  • But it was decoded using windows-1252 (the "ANSI" code page).
dan04
  • 87,747
  • 23
  • 163
  • 198
  • Good catch! I'd only looked at the 125x code pages -- totally forgot about the DOS ones... I'll add some more info to my reply. – Arnout Dec 16 '10 at 07:36
2

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?

Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...

Edit: dan04 hit the nail on the head. in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.

The fix is simple: just specify this encoding when opening your XML file:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}
Community
  • 1
  • 1
Arnout
  • 2,780
  • 14
  • 12
1

From here:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file?

For actual reading you can use StreamReader to take care of BOM(Byte order mark):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.

Edit 2: Detecting Text Encoding for StreamReader

Community
  • 1
  • 1
A G
  • 21,087
  • 11
  • 87
  • 112
0

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

robert
  • 33,242
  • 8
  • 53
  • 74
Tom P.
  • 1