5

I have a malformed XML file. The root tag is not closed by a tag. The final tag is missing.

When I try to load my malformed XML file in C#

StreamReader sr = new StreamReader(path);
batchFile = XDocument.Load(sr); // Exception

I get an exception "Unexpected end of file has occurred. The following elements are not closed: batch. Line 54, position 1."

Is it possible to ignore the close tag or to force the loading? I noticed that all my XML tools ((like XML notepad) ) automaticly fix or ignore the problem. I can not fix the XML file. This one copme from a third party software and sometimes the file is correct.

Bastien Vandamme
  • 17,659
  • 30
  • 118
  • 200

3 Answers3

4

You cant do it with XDocument because this class loads all document in memory and parse it completly.
But its possible to process document with XmlReader it would get you to read and process complete document and at the end youll get missing tag exeption.

Anton Semenov
  • 6,227
  • 5
  • 41
  • 69
3

I suggest using Tidy.NET to cleanup messy input

Tidy.NET has a nice API to get a list of problems (MessageCollection) in your 'XML' and you can use it to fix the text stream in memory. The simplest thing would be to fix one error at a time, thought that will not perform too well with many errors. Otherwise, you might fix errors in reverse document order so that the offsets of messages stay valid while doing the fixes

Here is an example to convert HTML input into XHTML:

Tidy tidy = new Tidy();

/* Set the options you want */
tidy.Options.DocType = DocType.Strict;
tidy.Options.DropFontTags = true;
tidy.Options.LogicalEmphasis = true;
tidy.Options.Xhtml = true;
tidy.Options.XmlOut = true;
tidy.Options.MakeClean = true;
tidy.Options.TidyMark = false;

/* Declare the parameters that is needed */
TidyMessageCollection tmc = new TidyMessageCollection();
MemoryStream input = new MemoryStream();
MemoryStream output = new MemoryStream();

byte[] byteArray = Encoding.UTF8.GetBytes("Put your HTML here...");
input.Write(byteArray, 0 , byteArray.Length);
input.Position = 0;
tidy.Parse(input, output, tmc);

string result = Encoding.UTF8.GetString(output.ToArray());
sehe
  • 374,641
  • 47
  • 450
  • 633
  • Adding a sample snippet to convert HTML -> XHTML – sehe Apr 18 '11 at 10:14
  • I've not got this working well with XML. Unless I'm missing something Tidy.NET wasn't designed for XML. – Gareth A. Lloyd Oct 07 '14 at 15:55
  • Yes. Tidy is intended to sanitizer wonky HTML. Because XHTML exists it could be worth a try. – sehe Oct 07 '14 at 15:58
  • I got as far as `tidy.Options.XmlOut = true; tidy.Options.TidyMark = false; tidy.Options.XmlTags = true;` But Tidy.NET crashes in the guts of PPrint.cs . I'm still looking at this approach. – Gareth A. Lloyd Oct 07 '14 at 16:38
1

What you could do is add the closing tag to the xml in memory and then load it.

So after loading the xml into the streamreader, manipulate the data before you do the xml load

Ivo
  • 3,406
  • 4
  • 33
  • 56