I need to take an XML file and create multiple output xml files from what could be thousands of repeating nodes of the input file. The source file "AnimalBatch.xml" looks like this:
<?xml version="1.0" encoding="utf-8" ?>
<Animals>
<Animal id="1001">
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>
<Animal id="1002">
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>
<Animal id="1003">
<Quantity>Three</Quantity>
<Color>Blind</Color>
<Name>Mice</Name>
</Animal>
</Animals>
But in actuality, there are no CR/LF characters in it. The actual stream of text looks like this:
<?xml version="1.0" encoding="utf-8" ?><Animals><Animal id="1001"><Quantity>One</Quantity><Adjective>Red</Adjective><Name>Rooster</Name></Animal><Animal id="1002"><Quantity>Two</Quantity><Adjective>Stubborn</Adjective><Name>Donkeys</Name></Animal><Animal id="1003"><Quantity>Three</Quantity><Color>Blind</Color><Name>Mice</Name></Animal></Animals>
The program needs to split the repeating "Animal" and produce 3 files named: Animal_1001.xml, Animal_1002.xml, and Animal_1003.xml
I had a prior question on this using XmlDocument, which was already answered.
See: [Splitting XML file into multiple xml using XmlDocument][1]
This question is abut how to use XmlReader to grab the elements and create XmlDocument elements from them.
Animal_1001.xml:
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>One</Quantity>
<Adjective>Red</Adjective>
<Name>Rooster</Name>
</Animal>
Animal_1002.xml
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Two</Quantity>
<Adjective>Stubborn</Adjective>
<Name>Donkeys</Name>
</Animal>
Animal_1003.xml>
<?xml version="1.0" encoding="utf-8"?>
<Animal>
<Quantity>Three</Quantity>
<Adjective>Blind</Adjective>
<Name>Mice</Name>
</Animal>
Here is the code that works - But only when there are line breaks in the input file:
static void SplitXMLReader()
{
string strFileName;
string strSeq;
XmlReader doc = XmlReader.Create("C:\\AnimalBatch.xml");
while (doc.Read())
{
if (doc.Name=="Animal")
{
strSeq = doc.GetAttribute("id");
XmlDocument outdoc = new XmlDocument();
XmlDeclaration xmlDeclaration = outdoc.CreateXmlDeclaration("1.0", "utf-8", null);
XmlElement rootNode = outdoc.CreateElement(doc.Name);
rootNode.InnerXml = doc.ReadInnerXml();
outdoc.InsertBefore(xmlDeclaration, outdoc.DocumentElement);
outdoc.AppendChild(rootNode);
strFileName = "Animal_" + strSeq + ".xml";
outdoc.Save("C:\\" + strFileName);
}
}
}
When this program is run on a copy of "AnimalBatch.xml" that has the carriage returns after each element - it works, and creates the Animal_xxxx.xml files as desired. When AnimalBatch.xml looks like the stream of unformatted text - it gets the first Animal - and can get it's ID of 1001 and writes the output file ok. It is able to read subsequent Animal elements but not get the "id" attribute - and ends up writing output files named "Animal_.xml" - as apparently the strSeq variable it's trying to read from the attribute is null or blank. By the end, the second file only contains this:
<?xml version="1.0" encoding="utf-8"?>
<Animal />
This leads me to believe that the XmlReader, at least to the extent of the doc.Read() method, (doc.Name=="Animal") statement or later the "strSeq = doc.GetAttribute("id");
" - works differently if there is a CR/LF after the <Animal id="1002">
tag.
I guess my real question is - when it does doc.GetAttribute("id"); Where is the cursor in doc? And why can't it get the ones after "1001" - which does work ?
John said XML does not care about formatting - And I've always thought so too - but this has be baffled. Also - for my application, the only way I can get the XML is unformatted, since I'm pulling out of SQL via SSIS and it's a text stream, not an XML object.