Consider the following XML doc:
<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
<b>
<c1 description="abc123" />
<c2 description="bbbasdasdbc123" />
<c3 description="cccbasdasdc123" />
</b>
<b>
<c1 description="abc123" />
<c2 description="bbbasdasdbc123" />
<c3 description="cccbasdasdc123" />
<c4 description="abc123"" />
<c5 description="bbbasdasdbc123" />
<c6 description="cccbasdasdc123" />
</b>
<b>
<c1 description="abcaslkjkl123" weight="10" />
</b>
</a>
As it stands this XML doc is invalid and in Firefox it points to the offending line: Line 12 col 27... ie the extra double quote. The double quote isn't the issue here. The cause of the error can be anything that causes the XML doc to be invalid.
The point is that when I try to load the XML doc an error occurs - from which I know the line number and column... - after which I have no choice but to flag the file as errored-do-something-with-it-later-on.
What I'd like to do is to delete the <b>
node (or extract it for further error processing at a later date) that encapsulates the offending line
ie delete
<b>
<c1 description="abc123" />
<c2 description="bbbasdasdbc123" />
<c3 description="cccbasdasdc123" />
<c4 description="abc123"" />
<c5 description="bbbasdasdbc123" />
<c6 description="cccbasdasdc123" />
</b>
leaving just
<?xml version="1.0" encoding="iso-8859-1" ?>
<a>
<b>
<c1 description="abc123" />
<c2 description="bbbasdasdbc123" />
<c3 description="cccbasdasdc123" />
</b>
<b>
<c1 description="abcaslkjkl123" weight="10" />
</b>
</a>
The XML can be quite large <= 100Mb
I've investigated these which lead me eventually to use File.ReadLines(sourceXMLFile).Take(...) etc
How to read a text file reversely with iterator in C#
Get last 10 lines of very large text file > 10GB
https://msdn.microsoft.com/en-us/library/w5aahf2a%28v=vs.110%29.aspx
and using a schema to valid the XML beforehand isn't an option (http://www.codeguru.com/csharp/csharp/cs_data/xml/article.php/c6737/Validation-of-XML-with-XSD.htm).
I've thought about ways to try and solve this knowing the offending line number and came up with this:
public void ProcessXMLFile(string sourceXMLFile, string errorFile)
{
XmlDocument xmlDocument = new XmlDocument();
string outputFile1 = @"c:\temp\f1.txt";
string outputFile2 = @"c:\temp\f2.txt";
string soughtOpeningNode = "<b>";
string soughtClosingNode = "</b>";
string firstPart = "";
string secondPart = "";
int lastNode = 0;
int firstNode = 0;
try
{
xmlDocument.Load(sourceXMLFile);
}
catch (XmlException ex)
{
int offendingLineNumber = ex.LineNumber;
// Create the first part of the file that comprises everything upto and including the line that caused the error
using (StreamWriter f1 = new StreamWriter(outputFile1))
{
firstPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Take(offendingLineNumber));
f1.WriteLine(firstPart);
lastNode = firstPart.LastIndexOf(soughtOpeningNode);
}
// Create the file that contains the remainder of the original file starting after the line number that caused the error
using (StreamWriter f2 = new StreamWriter(outputFile2))
{
secondPart = string.Join("\r\n", File.ReadLines(sourceXMLFile).Skip(offendingLineNumber));
f2.WriteLine(secondPart);
firstNode = secondPart.IndexOf(soughtClosingNode);
}
// Create the XML file without the node whose child caused the error...
using (StreamWriter d1 = new StreamWriter(sourceXMLFile))
{
d1.WriteLine(firstPart.Substring(0, lastNode));
d1.WriteLine(secondPart.Substring(firstNode + soughtOpeningNode.Length + 1));
}
// Write the node that contained the offending line number for later processing
using (StreamWriter d1 = new StreamWriter(errorFile, true))
{
d1.WriteLine(firstPart.Substring(lastNode));
d1.WriteLine(secondPart.Substring(0, firstNode + soughtClosingNode.Length + 1));
}
File.Delete(outputFile1);
File.Delete(outputFile2);
ProcessXMLFile(sourceXMLFile, errorFile);
}
}
And to kick off:
ProcessXMLFile(@"c:\temp\myBigFile.xml", @"c:\temp\myBigFile-errors.txt");
My questions then:
- This works but are there better ways to do this?
- When processing an XML file (c70Mb) that contains many errors, it eventually runs out of memory (Task Manager shows memory usage creeping forever upwards towards 99% on 16Gb m/c).
- Even when I force the routine to finish memory remains at 99% and only drops down when VS2010 is stopped so how can I make this more memory usage efficient?
Pointers would be appreciated.
Sai.