Parsing a large XML file to multiple output xmls, using XmlReader - getting every other element

Question

I need to take a very large XML file and create multiple output xml files from what could be thousands of repeating nodes of the input file. There is no whitespace in the source file "AnimalBatch.xml" which looks like this:

<?xml version="1.0" encoding="utf-8" ?><Animals><Animal id="1001"><Quantity>One</Quantity><Adjective>Red</Adjective><Name>Rooster</Name></Animal><Animal id="1002"><Quantity>Two</Quantity><Adjective>Stubborn</Adjective><Name>Donkeys</Name></Animal><Animal id="1003"><Quantity>Three</Quantity><Adjective>Blind</Adjective><Name>Mice</Name></Animal><Animal id="1004"><Quantity>Four</Quantity><Adjective>Purple</Adjective><Name>Horses</Name></Animal><Animal id="1005"><Quantity>Five</Quantity><Adjective>Long</Adjective><Name>Centipedes</Name></Animal><Animal id="1006"><Quantity>Six</Quantity><Adjective>Dark</Adjective><Name>Owls</Name></Animal></Animals>

The program needs to split the repeating "Animal" and produce the appropriate number of files named: Animal_1001.xml, Animal_1002.xml, Animal_1003.xml, etc.

Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>

Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>

Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>

The code below works, but only if the input file has CR/LF after the <Animal id="xxxx"> elements. If it has no "whitespace" (I don't, and can't get it like that), I get every other one (the odd numbered animals)

    static void SplitXMLReader()
    {
        string strFileName;
        string strSeq = "";

        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (doc.Read())
        {
            if ( doc.Name == "Animal"  && doc.NodeType == XmlNodeType.Element )
            {
                strSeq = doc.GetAttribute("id"); 

                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);                     
                XmlElement rootNode = outdoc.CreateElement(doc.Name);

                rootNode.InnerXml = doc.ReadInnerXml();  
                // This seems to be advancing the cursor in doc too far.

                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.AppendChild(rootNode);

                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);                    
            }
        }
    }

My understanding is that "whitespace" or formatting in XML should make no difference to XmlReader - but I've tried this both ways, with and without CR/LF's after the <Animal id="xxxx">, and can confirm there is a difference. If it has CR/LFs (possibly even just a space, which I'll try next) - it gets each <Animal> node processed fully, and saved under the right filename that comes from the id attribute.

Can someone let me know what's going on here - and a possible workaround?

followup? It is more like making SO people write OP's job step by step without his showing any real effort other than asking question — L.B, Aug 30 '12 at 00:44
What I am finding is that the whitespace IS significant. I expanded the sample to six elements to show the pattern of what I think is the problem - that the cursor on the input file is positioned just past the beginning of the next element. My prior question had CR/LFs after each element. Turns out I won't have those - and that's a restriction I can't control. Perhaps I have to use the XmlTextReader in this situation? — Rick Bellows, Aug 30 '12 at 00:44
No real effort?! I've been working on this on my own for a while now - and am trying to do it myself in several different ways. Only when I found that John's comment of "whitespace doesn't matter" isn't true - am I asking for help - and showing that that is not the case. — Rick Bellows, Aug 30 '12 at 00:50
Also - the prior questions were specifically: How to break it apart using XmlDocument, How to do it with XmlTextReader, and this one, How to do it with XmlReader. I'll admit I mistook how the input file would look like - but it does make a difference. The next one I was going to do was how to do this with XDocument. That would cover four majorly different approaches to skin the same basic - but I'm sure, pretty common, problem. — Rick Bellows, Aug 30 '12 at 01:00

score 0 · Accepted Answer · answered Aug 30 '12 at 01:00

0

yes, when using the doc.readInnerXml() white space is important.

From the documentation of the function. This returns a string. so of course white space will matter. If you want the inner text as a xmlNode you should use something like this

answered Aug 30 '12 at 01:00

corn3lius

4,857
2
31
36

Thanks, and thanks to everyone else who has helped with this. I've also found a very relevant SO question that applies to this at: http://stackoverflow.com/questions/7196468/xmlreader-problem-reading-xml-file-with-no-newlines – Rick Bellows Aug 30 '12 at 02:11

score 0 · Answer 2 · answered Aug 30 '12 at 04:18

Thanks for the guidance on using the ReadSubTree() method:

This code works for the XML input file with no linefeeds:

    static void SplitXMLReaderSubTree()
    {
        string strFileName;
        string strSeq = "";
        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (!doc.EOF)
        {
            if ( doc.Name == "Animal"  && doc.NodeType == XmlNodeType.Element )
            {
                strSeq = doc.GetAttribute("id");
                XmlReader inner = doc.ReadSubtree();
                inner.Read();
                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
                XmlElement myElement;
                myElement = outdoc.CreateElement(doc.Name);
                myElement.InnerXml = inner.ReadInnerXml();
                inner.Close();
                myElement.Attributes.RemoveAll();
                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.ImportNode(myElement, true);
                outdoc.AppendChild(myElement);
                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);                    
            }
            else
            {
                doc.Read();
            }
        }

Parsing a large XML file to multiple output xmls, using XmlReader - getting every other element

2 Answers2

Linked