Correcting Encoding in a large Xml File

Question

I'm importing data from XML files containing this type of content:

<FirstName>™MšR</FirstName><MiddleName/><LastName>HšNER™Z</LastName>

The XML is loaded via:

 XmlDocument doc = new XmlDocument();

 try
 {
      doc.Load(fullFilePath);
 }

When I execute this code with the data contained on top I get an exception about an illegal character. I understand that part just fine.

I'm not sure which encoding this is or how to solve this problem. Is there a way I can change the encoding of the XmlDocument or another method to make sure the above content is parsed correctly?

Update: I do not have any encoding declaration or <?xml in this document.

I've seen some links say to add it dynamically? Is this UTF-16 encoding?

score 3 · Answer 1 · answered Dec 16 '10 at 01:20

3

It appears that:

The name was ÖMÜR HÜNERÖZ (or possibly ÔMÜR HÜNERÔZ or ÕMÜR HÜNERÕZ; I don't know what language that is).
The XML file was encoded using the DOS "OEM" code page, probably 437 or 850.
But it was decoded using windows-1252 (the "ANSI" code page).

answered Dec 16 '10 at 01:20

dan04

87,747
23
163
198

Good catch! I'd only looked at the 125x code pages -- totally forgot about the DOS ones... I'll add some more info to my reply. – Arnout Dec 16 '10 at 07:36

score 2 · Answer 2 · edited May 23 '17 at 11:55

If you look at the file with a hex editor (HXD or Visual Studio, for instance), what exactly do you see?

Is every character from the string you posted represented by a single byte? Does the file have a byte-order mark (a bunch of non-printable bytes at the start of the file)?

The ™ and š seem to indicate that something went pretty wrong with encoding/conversion along the way, but let's see... I guess they both correspond with a vowel (O-M-A-R H-A-NER-O-Z, maybe?), but I haven't figured out yet how they ended up looking like this...

Edit: dan04 hit the nail on the head. ™ in cp-1252 has hex value 99, and š is 9a. In cp-437 and cp-850, hex 99 represents Ö, and 9a Ü.

The fix is simple: just specify this encoding when opening your XML file:

XmlDocument doc = new XmlDocument();

using (var reader = new StreamReader(fileName, Encoding.GetEncoding(437)))
{
   doc.Load(reader);
}

score 1 · Answer 3 · edited May 23 '17 at 11:55

From here:

Encoding encoding;
using (var stream = new MemoryStream(bytes))
{
    using (var xmlreader = new XmlTextReader(stream))
    {
        xmlreader.MoveToContent();
        encoding = xmlreader.Encoding;
    }
}

You might want to take a look at this: How to best detect encoding in XML file?

For actual reading you can use StreamReader to take care of BOM(Byte order mark):

string xml;

using (var reader = new StreamReader("FilePath", true))
{                                   //            ↑ 
    xml= reader.ReadToEnd();       //        detectEncodingFromByteOrderMarks
}

Edit: Removed the encoding parameter. StreamReader will detect the encoding of a file if the file contains a BOM. If it does not it will default to UTF8.

Edit 2: Detecting Text Encoding for StreamReader

score 0 · Answer 4 · edited Jun 03 '12 at 19:14

0

Obviously you provided a fragment of the XML document since it's missing a root element, so I'll assume that was your intention. Is there an xml processing instruction at the top like <?xml version="1.0" encoding="UTF-8" ?>?

edited Jun 03 '12 at 19:14

robert

33,242
8
53
74

answered Dec 15 '10 at 19:48

Tom P.

1

Correcting Encoding in a large Xml File

4 Answers4