As explained in What is the best way to parse (big) XML in C# Code?, you can use XmlReader
to stream through huge XML files with bounded memory consumption. However, XmlReader
is somewhat tricky to use as it's very easy to read too little or too much if the XML isn't exactly as expected. (Even insignificant whitespace can throw off an XmlReader
algorithm.)
To reduce the chance of making such errors, first introduce the following extension method, which iterates through all direct child elements of the current element:
public static partial class XmlReaderExtensions
{
/// <summary>
/// Read all immediate child elements of the current element, and yield return a reader for those matching the incoming name & namespace.
/// Leave the reader positioned after the end of the current element
/// </summary>
public static IEnumerable<XmlReader> ReadElements(this XmlReader inReader, string localName, string namespaceURI)
{
inReader.MoveToContent();
if (inReader.NodeType != XmlNodeType.Element)
throw new InvalidOperationException("The reader is not positioned on an element.");
var isEmpty = inReader.IsEmptyElement;
inReader.Read();
if (isEmpty)
yield break;
while (!inReader.EOF)
{
switch (inReader.NodeType)
{
case XmlNodeType.EndElement:
// Move the reader AFTER the end of the element
inReader.Read();
yield break;
case XmlNodeType.Element:
{
if (inReader.LocalName == localName && inReader.NamespaceURI == namespaceURI)
{
using (var subReader = inReader.ReadSubtree())
{
subReader.MoveToContent();
yield return subReader;
}
// ReadSubtree() leaves the reader positioned ON the end of the element, so read that also.
inReader.Read();
}
else
{
// Skip() leaves the reader positioned AFTER the end of the element.
inReader.Skip();
}
}
break;
default:
// Not an element: Text value, whitespace, comment. Read it and move on.
inReader.Read();
break;
}
}
}
/// <summary>
/// Read all immediate descendant elements of the current element, and yield return a reader for those matching the incoming name & namespace.
/// Leave the reader positioned after the end of the current element
/// </summary>
public static IEnumerable<XmlReader> ReadDescendants(this XmlReader inReader, string localName, string namespaceURI)
{
inReader.MoveToContent();
if (inReader.NodeType != XmlNodeType.Element)
throw new InvalidOperationException("The reader is not positioned on an element.");
using (var reader = inReader.ReadSubtree())
{
while (reader.ReadToFollowing(localName, namespaceURI))
{
using (var subReader = inReader.ReadSubtree())
{
subReader.MoveToContent();
yield return subReader;
}
}
}
// Move the reader AFTER the end of the element
inReader.Read();
}
}
With that, your python algorithm can be reproduced as follows:
var zipListBox = new List<string>();
using (var archive = ZipFile.Open(fullFileName, ZipArchiveMode.Read))
{
foreach (var entry in archive.Entries)
{
if (Path.GetExtension(entry.Name).Equals(".xml", StringComparison.OrdinalIgnoreCase))
{
using (var zipEntryStream = entry.Open())
using (var reader = XmlReader.Create(zipEntryStream))
{
// Move to the root element
reader.MoveToContent();
var query = reader
// Read all child elements <Widget>
.ReadElements("Widget", "")
// And extract the text content of their first child element <Description>
.SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));
zipListBox.AddRange(query);
}
}
}
}
Notes:
Your c# XPath queries do not match your original python queries. Your original python code does the following:
zfft = et.parse(zff).getroot()
This unconditionally get the root element (docs).
zffts = zfft.findall('Widget')
This finds all immediate child elements named "Widget" (recursive descent operator //
was not used) (docs).
wgt.find('Description').text for wgt in zffts
This loops though the widgets and, for each, finds the first child element named "Description" and gets its text (docs).
For comparison xmlDoc.SelectNodes("//Root/Widget")
recursively descends the entire XML element hierarchy to find nodes named <Widget>
nested inside nodes named <Root>
-- which is probably not what you want. Similarly tmp.SelectSingleNode("//Description")
recursively descends the XML hierarchy under <Widget>
to find a description node. Recursive descent may work here but could possibly return a different result if there are multiple nested <Description>
nodes.
Using XmlReader.ReadSubtree()
ensures that the entire element is consumed -- no more and no less.
ReadElements()
works well with LINQ to XML. E.g. if you want to stream through your XML and get the id, description, and name of each widget without loading them all into memory, you could do:
var query = reader
.ReadElements("Widget", "")
.Select(r => XElement.Load(r))
.Select(e => new { Description = e.Element("Description")?.Value, Id = e.Attribute("id")?.Value, Name = e.Element("Name")?.Value });
foreach (var widget in query)
{
Console.WriteLine("Id = {0}, Name = {1}, Description = {2}", widget.Id, widget.Name, widget.Description);
}
Here again memory use will be bounded because only one XElement
corresponding to a single <Widget>
will be referenced at any time.
Demo fiddle here.
Update
How would your code change if the collection of <Widget>
tags, rather than being straight off the XML root, were in fact themselves contained in a single <Widgets>
subtree of the root?
You have a couple options here. Firstly, you could make nested calls to ReadElements
by chaining together LINQ statements that flatten the element hierarchy with SelectMany
:
var query = reader
// Read all child elements <Widgets>
.ReadElements("Widgets", "")
// Read all child elements <Widget>
.SelectMany(r => r.ReadElements("Widget", ""))
// And extract the text content of their first child element <Description>
.SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));
Use this option if you are only interested in reading <Widget>
nodes only at some particular XPath.
Alternatively, you could simply read to all descendants named <Widget>
as shown here:
var query = reader
// Read all descendant elements <Widget>
.ReadDescendants("Widget", "")
// And extract the text content of their first child element <Description>
.SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));
Use this option if are interested in reading <Widget>
nodes wherever they occur in the XML.
Demo fiddle #2 here.