I need to split apart a large XML file to multiple output xmls, using XmlTextReader

Question

I need to take an XML file and create multiple output xml files from what could be thousands of repeating nodes of the input file. The source file "AnimalBatch.xml" looks like this:

<?xml version="1.0" encoding="utf-8" ?>
<Animals>
<Animal id="1001">
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>
<Animal id="1002">
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>
<Animal id="1003">
<Quantity>Three</Quantity>
<Color>Blind</Color>
<Name>Mice</Name>
</Animal>
</Animals>

But in actuality, there are no CR/LF characters in it. The actual stream of text looks like this:

<?xml version="1.0" encoding="utf-8" ?><Animals><Animal id="1001"><Quantity>One</Quantity><Adjective>Red</Adjective><Name>Rooster</Name></Animal><Animal id="1002"><Quantity>Two</Quantity><Adjective>Stubborn</Adjective><Name>Donkeys</Name></Animal><Animal id="1003"><Quantity>Three</Quantity><Color>Blind</Color><Name>Mice</Name></Animal></Animals>

The program needs to split the repeating "Animal" and produce 3 files named: Animal_1001.xml, Animal_1002.xml, and Animal_1003.xml

I had a prior question on this using XmlDocument, which was already answered.
See: [Splitting XML file into multiple xml using XmlDocument][1]

This question is abut how to use XmlReader to grab the elements and create XmlDocument elements from them.

Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>

Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>

Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>

Here is the code that works - But only when there are line breaks in the input file:

    static void SplitXMLReader() 
    {
        string strFileName;
        string strSeq;

        XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

        while (doc.Read())
        {
            if (doc.Name=="Animal")
            {
                strSeq = doc.GetAttribute("id");

                XmlDocument outdoc = new XmlDocument();
                XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
                XmlElement rootNode = outdoc.CreateElement(doc.Name);

                rootNode.InnerXml = doc.ReadInnerXml();
                outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                outdoc.AppendChild(rootNode);

                strFileName = "Animal_" + strSeq + ".xml";
                outdoc.Save("C:\\" + strFileName);
            }
        }
    }

When this program is run on a copy of "AnimalBatch.xml" that has the carriage returns after each element - it works, and creates the Animal_xxxx.xml files as desired. When AnimalBatch.xml looks like the stream of unformatted text - it gets the first Animal - and can get it's ID of 1001 and writes the output file ok. It is able to read subsequent Animal elements but not get the "id" attribute - and ends up writing output files named "Animal_.xml" - as apparently the strSeq variable it's trying to read from the attribute is null or blank. By the end, the second file only contains this:

<?xml version="1.0" encoding="utf-8"?>
<Animal />

This leads me to believe that the XmlReader, at least to the extent of the doc.Read() method, (doc.Name=="Animal") statement or later the "strSeq = doc.GetAttribute("id"); " - works differently if there is a CR/LF after the <Animal id="1002"> tag.

I guess my real question is - when it does doc.GetAttribute("id"); Where is the cursor in doc? And why can't it get the ones after "1001" - which does work ?

John said XML does not care about formatting - And I've always thought so too - but this has be baffled. Also - for my application, the only way I can get the XML is unformatted, since I'm pulling out of SQL via SSIS and it's a text stream, not an XML object.

FYI, don't use `new XmlTextReader`. Use `XmlReader.Create` instead. — John Saunders, Aug 27 '12 at 19:01
John - I'm going to have to give you a standing ovation. Your observation to use the XmlReader instead of XmlTextReader was the solution. The problem with the text reader apparently had to do with how it was not recognizing the subsequent "Animal" elements (it would get the first - but the moment I tried to get the Attribute "ID" - it only ever found the first one - and well - I had a mess on my hands. I will post the code which now works. — Rick Bellows, Aug 27 '12 at 20:52
John - I've found that my input file does not have the "formating" of CR/LF's as shown in my sample. Does that mean I have to use the xmlTextReader? I had gotten to a certain level of success using it (i.e. I could get the outer xml - just couldn't extract the ID attribute). Maybe I need to specify this in a separate question. — Rick Bellows, Aug 28 '12 at 03:09

score 0 · Answer 1 · answered Aug 27 '12 at 07:03

First of all, I don't see you assigning anything to outdoc anywhere... I suppose you wanted to fill it with current node data, and then save it? Also, I'd create one XmlDocument object, and then clear/fill it in the loop, creating new object in loop couple thousand times isn't that good idea...

Also notice that XmlReader is moving one element at a time. So your code atm would:

Call XmlRead() and not fall into any case (It'd read first ?xml declaration)
Call XmlRead() once, fall into the case, move to id attribute and write empty file.
Call XmlRead() 10 times \, skipping everything until next Animal element.

One solution to grab data from inside <Animal> tag is similar to This example on msdn.

Second is to think of more convenient way, like ReadInnerXml method with ReadToFollowing, for example. Also, take a look at GetAttribute method.

My procedure would be:

string toFile = "";
Read file until <Animal> tag.
GetAttribute("id");
toFile = ReadInnerXml();
Write toFile to file ;)
doc.ReadToFollowing("Animal");

With probably some minor adjustments, as I'm not checking what I write with compiler...

score 0 · Answer 2 · answered Aug 27 '12 at 11:06

You need create root node on outdoc. Use this code:

    static void SplitXMLTextReader()
    {

        string strFileName;
        string strSeq = "0";

        XmlTextReader doc = new XmlTextReader(("C:\\AnimalBatch.xml"));
        doc.WhitespaceHandling = WhitespaceHandling.None;

        while (doc.Read())
        {
            switch (doc.Name)
            {
                case "Animal":
                    XmlDocument outdoc = new XmlDocument();
                   XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
                       XmlElement rootNode = outdoc.CreateElement(doc.Name);
                    rootNode.InnerXml = doc.ReadInnerXml();
                    outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
                    outdoc.AppendChild(rootNode);


                    doc.MoveToFirstAttribute();
                    if (string.Compare(doc.Name, "id", true) == 0)
                    {
                        strSeq = doc.Value;
                    }
                    strFileName = "Animal_" + strSeq + ".xml";
                    outdoc.Save("C:\\" + strFileName);
                    break;
            }
        }

    }

This solution is close, but has a bug: I get two output xml files named Animal_0002.xml and Animal_003.xml. Animal_0002.xml has a complete output file, but has the content of the first animal (one red rooster), and Animal_0002.xml has just an empty tag , but not the payload. I'm thinking that the part of the program that grabs the id (section starting "doc.MoveToFirstAttribute() ) - might need grab its info from outdoc - after it's been appended into outdoc. However - your code is very close. I see how you are creating the outdoc XmlDocuments inside the doc.Read() loop. — Rick Bellows, Aug 27 '12 at 16:47

score 0 · Answer 3 · answered Aug 27 '12 at 21:00

0

static void SplitXMLReader()
{
    string strFileName;
    string strSeq;

    XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");

    while (doc.Read())
    {
        if (doc.Name=="Animal")
        {
            strSeq = doc.GetAttribute("id");

            XmlDocument outdoc = new XmlDocument();
            XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
            XmlElement rootNode = outdoc.CreateElement(doc.Name);

            rootNode.InnerXml = doc.ReadInnerXml();
            outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
            outdoc.AppendChild(rootNode);

            strFileName = "Animal_" + strSeq + ".xml";
            outdoc.Save("C:\\" + strFileName);
        }
    }
}

answered Aug 27 '12 at 21:00

Rick Bellows

85
2
11

OMG - I've found that my "Batch" xml file is a Text stream - and does not have the CRLF's that I specified in my "AnimalBatch.xml" sample file. The above 'solution' works when there is a CRLF after the node - but can't use the XmlReader if there isn't. I'm back to working on using XmlTextReader. Pooh. – Rick Bellows Aug 28 '12 at 03:06
The answer on how to do this when there are no linefeeds in the XML input file is under another question: http://stackoverflow.com/questions/12188383/parsing-a-large-xml-file-to-multiple-output-xmls-using-xmlreader-getting-ever/12189807#12189807 – Rick Bellows Aug 30 '12 at 04:33

I need to split apart a large XML file to multiple output xmls, using XmlTextReader

3 Answers3