21

I have a Windows desktop app written in C# that loops through a bunch of XML files stored on disk and created by a 3rd party program. Most all the files are loaded and processed successfully by the LINQ code that follows this statement:

XDocument xmlDoc = XDocument.Load(inFileName);
List<DocMetaData> docList =
      (from d in xmlDoc.Descendants("DOCUMENT")
       select new DocMetaData
       {
      File = d.Element("FILE").SafeGetAttributeValue("filename")
         ,
      Folder = d.Element("FOLDER").SafeGetAttributeValue("name")
         ,
      ItemID = d.Elements("INDEX")
          .Where(i => (string)i.Attribute("name") == "Item ID(idmId)")
          .Select(i => (string)i.Attribute("value"))
          .FirstOrDefault()
         ,
      Comment = d.Elements("INDEX")
          .Where(i => (string)i.Attribute("name") == "Comment(idmComment)")
          .Select(i => (string)i.Attribute("value"))
          .FirstOrDefault()
         ,
      Title = d.Elements("INDEX")
          .Where(i => (string)i.Attribute("name") == "Title(idmName)")
          .Select(i => (string)i.Attribute("value"))
          .FirstOrDefault()
         ,
      DocClass = d.Elements("INDEX")
          .Where(i => (string)i.Attribute("name") == "Document Class(idmDocType)")
          .Select(i => (string)i.Attribute("value"))
          .FirstOrDefault()
       }
      ).ToList<DocMetaData>();

...where inFileName is a full path and filename such as:

     Y:\S2Out\B0000004\Pet Tab\convert.B0000004.Pet Tab.xml

But a few of the files cause problems like this:

System.Xml.XmlException: Invalid character in the given encoding. Line 52327, position 126.
at System.Xml.XmlTextReaderImpl.Throw(Exception e)
at System.Xml.XmlTextReaderImpl.Throw(String res, String arg)
at System.Xml.XmlTextReaderImpl.InvalidCharRecovery(Int32& bytesCount, Int32& charsCount)
at System.Xml.XmlTextReaderImpl.GetChars(Int32 maxCharsCount)
at System.Xml.XmlTextReaderImpl.ReadData()
at System.Xml.XmlTextReaderImpl.ParseAttributeValueSlow(Int32 curPos, Char quoteChar, NodeData attr)
at System.Xml.XmlTextReaderImpl.ParseAttributes()
at System.Xml.XmlTextReaderImpl.ParseElement()
at System.Xml.XmlTextReaderImpl.ParseElementContent()
at System.Xml.XmlTextReaderImpl.Read()
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r)
at System.Xml.Linq.XContainer.ReadContentFrom(XmlReader r, LoadOptions o)
at System.Xml.Linq.XDocument.Load(XmlReader reader, LoadOptions options)
at System.Xml.Linq.XDocument.Load(String uri, LoadOptions options)
at System.Xml.Linq.XDocument.Load(String uri)
at CBMI.WinFormsUI.GridForm.processFile(StreamWriter oWriter, String inFileName, Int32 XMLfileNumber) in C:\ProjectsVS2010\CBMI.LatitudePostConverter\CBMI.LatitudePostConverter\CBMI.WinFormsUI\GridForm.cs:line 147
at CBMI.WinFormsUI.GridForm.btnProcess_Click(Object sender, EventArgs e) in C:\ProjectsVS2010\CBMI.LatitudePostConverter\CBMI.LatitudePostConverter\CBMI.WinFormsUI\GridForm.cs:line 105

The XML files look like this (this sample shows only 2 DOCUMENT elements but there are many):

<?xml version="1.0" ?>
<DOCUMENTCOLLECTION>
   <DOCUMENT>
       <FILE filename="e:\S2Out\B0000005\General\D003712420.0001.pdf" outputpath="e:\S2Out\B0000005\General"/>
       <ANNOTATION filename=""/>
       <INDEX name="Comment(idmComment)" value=""/>
       <INDEX name="Document Class(idmDocType)" value="General"/>
       <INDEX name="Item ID(idmId)" value="003712420"/>
       <INDEX name="Original File Name(idmDocOriginalFile)" value="Matrix Aligning 603.24 Criteria to Petition Pages.pdf"/>
       <INDEX name="Title(idmName)" value="Matrix for 603.24"/>
       <FOLDER name="/Accreditation/PASBVE/2004-06"/>
   </DOCUMENT>
   <DOCUMENT>
       <FILE filename="e:\S2Out\B0000005\General\D003712442.0001.pdf" outputpath="e:\S2Out\B0000005\General"/>
       <ANNOTATION filename=""/>
       <INDEX name="Comment(idmComment)" value=""/>
       <INDEX name="Document Class(idmDocType)" value="General"/>
       <INDEX name="Item ID(idmId)" value="003712442"/>
       <INDEX name="Original File Name(idmDocOriginalFile)" value="Contacts at NDU.pdf"/>
       <INDEX name="Title(idmName)" value="Contacts at NDU"/>
       <FOLDER name="/Accreditation/NDU/2006-12/Self-Study"/>
   </DOCUMENT>

The LINQ statements have their own complexities but I think it works OK; it is the LOAD that fails. I have looked at the various constructors for XDocument Load and I've researched some other questions having this Exception thrown but I am confused about how to prevent this.

Lastly, at line 52327, position 126, in the file that failed to load, it appears that this data on line 52327 should NOT have caused the problem (and the last character is at position 103!

<FILE filename="e:\S2Out\B0000004\Pet Tab\D003710954.0001.pdf" outputpath="e:\S2Out\B0000004\Pet Tab"/>
Akshay Soam
  • 1,580
  • 3
  • 21
  • 39
John Adams
  • 4,773
  • 25
  • 91
  • 131
  • Can you include line 52327 of the file that failed so that we can see what the content is that caused the exception? – competent_tech Nov 26 '11 at 02:25
  • Just added it. Makes no sense to me. – John Adams Nov 26 '11 at 02:26
  • 1
    Please post actual XML that will _actually_ cause the problem. – John Saunders Nov 26 '11 at 02:29
  • 2
    How about the next or previous line? Do they have the appropriate number of chars? Also, you might try opening in an editor (if you aren't already) that can display at least placeholders for invalid chars (i.e. NoteTab Pro, which I only suggest because it's the only one I know). – competent_tech Nov 26 '11 at 02:30
  • Most popular web browsers will validate your XML and show you exactly where the invalid content is found. – phatfingers Nov 26 '11 at 02:56

4 Answers4

47

In order to control the encoding (once you know what it is), you can load the files using the Load method override that accepts a Stream.

Then you can create a new StreamReader against your file specifying the appropriate Encoding in the constructor.

For example, to open the file using Western European encoding, replace the following line of code in the question:

XDocument xmlDoc = XDocument.Load(inFileName);

with this code:

XDocument xmlDoc = null;

using (StreamReader oReader = new StreamReader(inFileName, Encoding.GetEncoding("ISO-8859-1"))) {
    xmlDoc = XDocument.Load(oReader);
}

The list of supported encodings can be found in the MSDN documentation.

competent_tech
  • 44,465
  • 11
  • 90
  • 113
2

Because XmlDocument loads the entire thing as soon as it runs into an unencoded character it aborts the entire process. If you want to process what you can and skip/log duff bits, look at XmlTextReader. XmlTextReader loaded from a Filestream will load a node at a time, so it will also use a lot less memory. You could even get clever and split the thing up and parallelise the processing.

When I've had this it's been things like accented characters in there: grave, acutes, umlauts, and such.

I don't have any automated processes, so usually I just load the file in Visual Studio and edited the bad guys out until there are no squigglies left. The theory is sound though.

dcsohl
  • 7,186
  • 1
  • 26
  • 44
Tony Hopkinson
  • 20,172
  • 3
  • 31
  • 39
2

The referenced file contains a character that is valid for a filename, but invalid in an XML attribute. You have a few options.

  1. You could change the filename and rerun your third-party script.
  2. You could work with the vendor to provide a patch that safely encodes the offending characters.
  3. You could pre-validate the XML documents and remove the offending entries prior to processing.
phatfingers
  • 9,770
  • 3
  • 30
  • 44
  • 1
    Option 2 is the high road. The vendor that wrote the software to produce the XML documents should be providing valid XML. Their bug is likely affecting not only you, but other customers. – phatfingers Nov 26 '11 at 02:51
2

Not sure if this is your case, but this can be related to invalid byte sequences for a given encoding. Example: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences.

Try filtering invalid sequences from the file while loading.

Igor S.
  • 553
  • 4
  • 10