How to delete an invalid XML node from XML document when one of its attributes contains invalid data

Question

Consider the following XML doc:

<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
        <c4 description="abc123"" />    
        <c5 description="bbbasdasdbc123" /> 
        <c6 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abcaslkjkl123" weight="10" />
    </b>
</a>

As it stands this XML doc is invalid and in Firefox it points to the offending line: Line 12 col 27... ie the extra double quote. The double quote isn't the issue here. The cause of the error can be anything that causes the XML doc to be invalid.

The point is that when I try to load the XML doc an error occurs - from which I know the line number and column... - after which I have no choice but to flag the file as errored-do-something-with-it-later-on.

What I'd like to do is to delete the <b> node (or extract it for further error processing at a later date) that encapsulates the offending line

ie delete

    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
        <c4 description="abc123"" />    
        <c5 description="bbbasdasdbc123" /> 
        <c6 description="cccbasdasdc123" /> 
    </b>

leaving just

<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
    <b>
        <c1 description="abc123" /> 
        <c2 description="bbbasdasdbc123" /> 
        <c3 description="cccbasdasdc123" /> 
    </b>
    <b>
        <c1 description="abcaslkjkl123" weight="10" />
    </b>
</a>

The XML can be quite large <= 100Mb

I've investigated these which lead me eventually to use File.ReadLines(sourceXMLFile).Take(...) etc

How to read a text file reversely with iterator in C#

Get last 10 lines of very large text file > 10GB

https://msdn.microsoft.com/en-us/library/w5aahf2a%28v=vs.110%29.aspx

and using a schema to valid the XML beforehand isn't an option (http://www.codeguru.com/csharp/csharp/cs_data/xml/article.php/c6737/Validation-of-XML-with-XSD.htm).

I've thought about ways to try and solve this knowing the offending line number and came up with this:

    public void ProcessXMLFile(string sourceXMLFile, string errorFile)
    {
        XmlDocument xmlDocument = new XmlDocument();

        string outputFile1 = @"c:\temp\f1.txt";
        string outputFile2 = @"c:\temp\f2.txt";

        string soughtOpeningNode = "<b>";
        string soughtClosingNode = "</b>";

        string firstPart = "";
        string secondPart = "";
        int lastNode = 0;
        int firstNode = 0;


        try
        {
            xmlDocument.Load(sourceXMLFile);
        }
        catch (XmlException ex)
        {
            int offendingLineNumber = ex.LineNumber;

            // Create the first part of the file that comprises everything upto and including the line that caused the error
            using (StreamWriter f1 = new StreamWriter(outputFile1))
            {
                firstPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Take(offendingLineNumber));
                f1.WriteLine(firstPart);
                lastNode = firstPart.LastIndexOf(soughtOpeningNode);
            }

            // Create the file that contains the remainder of the original file starting after the line number that caused the error
            using (StreamWriter f2 = new StreamWriter(outputFile2))
            {
                secondPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Skip(offendingLineNumber));
                f2.WriteLine(secondPart);
                firstNode = secondPart.IndexOf(soughtClosingNode);
            }

            // Create the XML file without the node whose child caused the error...
            using (StreamWriter d1 = new StreamWriter(sourceXMLFile))
            {
                d1.WriteLine(firstPart.Substring(0, lastNode));
                d1.WriteLine(secondPart.Substring(firstNode + soughtOpeningNode.Length + 1));
            }

            // Write the node that contained the offending line number for later processing
            using (StreamWriter d1 = new StreamWriter(errorFile, true))
            {
                d1.WriteLine(firstPart.Substring(lastNode));
                d1.WriteLine(secondPart.Substring(0, firstNode + soughtClosingNode.Length + 1));
            }

            File.Delete(outputFile1);
            File.Delete(outputFile2);

            ProcessXMLFile(sourceXMLFile, errorFile);
        }
    }

And to kick off:

ProcessXMLFile(@"c:\temp\myBigFile.xml", @"c:\temp\myBigFile-errors.txt");

My questions then:

This works but are there better ways to do this?
When processing an XML file (c70Mb) that contains many errors, it eventually runs out of memory (Task Manager shows memory usage creeping forever upwards towards 99% on 16Gb m/c).
Even when I force the routine to finish memory remains at 99% and only drops down when VS2010 is stopped so how can I make this more memory usage efficient?

Pointers would be appreciated.

Sai.

Use a stream class like StreamReader with ReadLine() method to get to line that needs editing. — jdweng, Mar 03 '16 at 10:32
hi @jdweng. This won't do what I'm after ie how to cut out the node that contains the offending line. — err1, Mar 03 '16 at 10:54
HI yes, just spotted the logic error there. I was trying to to be a smart alec. Will post up new code that works on the original file (original file get smaller whilst results will get bigger). Apologies for messing you around. — err1, Mar 03 '16 at 11:53

score 1 · Answer 1 · answered Mar 04 '16 at 10:18

This seems a dodgy thing to try and do. In general, if an XML file is not well-formed, you can't read it as an XML file. The line and column that appear in the error message don't necessarily tell you "this is the position of the error", it simply tells you at which point the XML parser couldn't make sense of the file and gave up.

So at best, you are handling a subset of the possible errors in the XML file. It may be that in your case you know what kind of error you expect to see (for example data within an element that is not being properly encoded) in which case it might make sense to try and strip out the enclosing element as you are doing, but it would still be get the code fixed that is creating the input file.

Now addressing your specific questions, your code seems to do it in a reasonable way, although if you know exactly what types of errors you expect (e.g. doubled-up quotes as in your example) maybe you could search the file for those specific things rather than repeatedly trying to parse it as XML and handling the resulting error.

As far as memory use is concerned, do you still have a problem when you do a Release build and run it outside the debugger? I found memory use grew continually under the debugger, presumably because garbage collection isn't done as aggressively, but when I run a Release build it stays steady.

Hi David, thank you for your comments. Agreed that I shouldn't have to deal with ill-formed XML but alas, it's situation that I must allow for. Glad to hear that my approach is sound. Oh, and I addressed the memory issue by putting the functionality into a class, invoking and disposing()... Now processing the 70Mb XML is fine. — err1, Mar 07 '16 at 20:19

How to delete an invalid XML node from XML document when one of its attributes contains invalid data

1 Answers1