1

I'm trying to translate into C# a piece of Python code that takes a ZIP file full of XML files, then for each XML file performs a specific XPath query and returns the result. In Python it's pretty lightweight and looks like this (I realise the example below is not strictly XPath but I wrote it a while ago!):

with zipfile.ZipFile(fullFileName) as zf:
zfxml = [f for f in zf.namelist() if f.endswith('.xml')]
for zfxmli in zfxml:
    with zf.open(zfxmli) as zff:
        zfft = et.parse(zff).getroot()
        zffts = zfft.findall('Widget')
        print ([wgt.find('Description').text for wgt in zffts])

The closest I've managed to get in C# has been:

foreach (ZipArchiveEntry entry in archive.Entries)
{
    FileInfo fi = new FileInfo(entry.FullName);

    if (fi.Extension.Equals(".xml", StringComparison.OrdinalIgnoreCase))
    {
        using (Stream zipEntryStream = entry.Open())
        {
            XmlDocument xmlDoc = new XmlDocument();

            xmlDoc.Load(zipEntryStream);
            XmlNodeList wgtNodes = xmlDoc.SelectNodes("//Root/Widget");

            foreach (XmlNode tmp in wgtNodes)
            {
                zipListBox.Items.Add(tmp.SelectSingleNode("//Description"));
            }
        }
    }
}

Although this does work for smaller ZIP files, it takes up way more memory than the Python implementation, and crashes out of memory if the ZIP file has too many XML files in it. Is there another, more efficient, way of achieving this?

dbc
  • 104,963
  • 20
  • 228
  • 340
  • 1
    Use `XmlReader` to read large xml files. – Cinchoo Nov 04 '19 at 14:18
  • As mentioned above, `XmlReader` is the what to use. See e.g. [What is the best way to parse (big) XML in C# Code?](https://stackoverflow.com/q/676274/3744182), [XmlReader - how to deal with large XML-Files?](https://stackoverflow.com/q/24045243/3744182), [Read Mulitple childs and extract data xmlReader in c#](https://stackoverflow.com/q/38425140/3744182). However, the `XmlReader` API is kind of hard to use, so will you need specific help after reading those previous answers? – dbc Nov 04 '19 at 17:56

1 Answers1

1

As explained in What is the best way to parse (big) XML in C# Code?, you can use XmlReader to stream through huge XML files with bounded memory consumption. However, XmlReader is somewhat tricky to use as it's very easy to read too little or too much if the XML isn't exactly as expected. (Even insignificant whitespace can throw off an XmlReader algorithm.)

To reduce the chance of making such errors, first introduce the following extension method, which iterates through all direct child elements of the current element:

public static partial class XmlReaderExtensions
{
    /// <summary>
    /// Read all immediate child elements of the current element, and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadElements(this XmlReader inReader, string localName, string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        var isEmpty = inReader.IsEmptyElement;
        inReader.Read();
        if (isEmpty)
            yield break;
        while (!inReader.EOF)
        {
            switch (inReader.NodeType)
            {
                case XmlNodeType.EndElement:
                    // Move the reader AFTER the end of the element
                    inReader.Read();
                    yield break;
                case XmlNodeType.Element:
                    {
                        if (inReader.LocalName == localName && inReader.NamespaceURI == namespaceURI)
                        {
                            using (var subReader = inReader.ReadSubtree())
                            {
                                subReader.MoveToContent();
                                yield return subReader;
                            }
                            // ReadSubtree() leaves the reader positioned ON the end of the element, so read that also.
                            inReader.Read();
                        }
                        else
                        {
                            // Skip() leaves the reader positioned AFTER the end of the element.
                            inReader.Skip();
                        }
                    }
                    break;
                default:
                    // Not an element: Text value, whitespace, comment.  Read it and move on.
                    inReader.Read();
                    break;
            }
        }
    }

    /// <summary>
    /// Read all immediate descendant elements of the current element, and yield return a reader for those matching the incoming name & namespace.
    /// Leave the reader positioned after the end of the current element
    /// </summary>
    public static IEnumerable<XmlReader> ReadDescendants(this XmlReader inReader, string localName, string namespaceURI)
    {
        inReader.MoveToContent();
        if (inReader.NodeType != XmlNodeType.Element)
            throw new InvalidOperationException("The reader is not positioned on an element.");
        using (var reader = inReader.ReadSubtree())
        {
            while (reader.ReadToFollowing(localName, namespaceURI))
            {
                using (var subReader = inReader.ReadSubtree())
                {
                    subReader.MoveToContent();
                    yield return subReader;
                }
            }
        }
        // Move the reader AFTER the end of the element
        inReader.Read();
    }
}

With that, your python algorithm can be reproduced as follows:

var zipListBox = new List<string>();

using (var archive = ZipFile.Open(fullFileName, ZipArchiveMode.Read))
{
    foreach (var entry in archive.Entries)
    {
        if (Path.GetExtension(entry.Name).Equals(".xml", StringComparison.OrdinalIgnoreCase))
        {
            using (var zipEntryStream = entry.Open())
            using (var reader = XmlReader.Create(zipEntryStream))
            {
                // Move to the root element
                reader.MoveToContent();

                var query = reader
                    // Read all child elements <Widget>
                    .ReadElements("Widget", "")
                    // And extract the text content of their first child element <Description>
                    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

                zipListBox.AddRange(query);
            }
        }
    }
}

Notes:

  • Your c# XPath queries do not match your original python queries. Your original python code does the following:

    zfft = et.parse(zff).getroot()
    

    This unconditionally get the root element (docs).

    zffts = zfft.findall('Widget')
    

    This finds all immediate child elements named "Widget" (recursive descent operator // was not used) (docs).

    wgt.find('Description').text for wgt in zffts
    

    This loops though the widgets and, for each, finds the first child element named "Description" and gets its text (docs).

    For comparison xmlDoc.SelectNodes("//Root/Widget") recursively descends the entire XML element hierarchy to find nodes named <Widget> nested inside nodes named <Root> -- which is probably not what you want. Similarly tmp.SelectSingleNode("//Description") recursively descends the XML hierarchy under <Widget> to find a description node. Recursive descent may work here but could possibly return a different result if there are multiple nested <Description> nodes.

  • Using XmlReader.ReadSubtree() ensures that the entire element is consumed -- no more and no less.

  • ReadElements() works well with LINQ to XML. E.g. if you want to stream through your XML and get the id, description, and name of each widget without loading them all into memory, you could do:

    var query = reader
        .ReadElements("Widget", "")
        .Select(r => XElement.Load(r))
        .Select(e => new { Description = e.Element("Description")?.Value, Id = e.Attribute("id")?.Value, Name = e.Element("Name")?.Value });
    
    foreach (var widget in query)
    {
        Console.WriteLine("Id = {0}, Name = {1}, Description = {2}", widget.Id, widget.Name, widget.Description);
    }
    

    Here again memory use will be bounded because only one XElement corresponding to a single <Widget> will be referenced at any time.

Demo fiddle here.

Update

How would your code change if the collection of <Widget> tags, rather than being straight off the XML root, were in fact themselves contained in a single <Widgets> subtree of the root?

You have a couple options here. Firstly, you could make nested calls to ReadElements by chaining together LINQ statements that flatten the element hierarchy with SelectMany:

var query = reader
    // Read all child elements <Widgets>
    .ReadElements("Widgets", "")
    // Read all child elements <Widget>
    .SelectMany(r => r.ReadElements("Widget", ""))
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

Use this option if you are only interested in reading <Widget> nodes only at some particular XPath.

Alternatively, you could simply read to all descendants named <Widget> as shown here:

var query = reader
    // Read all descendant elements <Widget>
    .ReadDescendants("Widget", "")
    // And extract the text content of their first child element <Description>
    .SelectMany(r => r.ReadElements("Description", "").Select(i => i.ReadElementContentAsString()).Take(1));

Use this option if are interested in reading <Widget> nodes wherever they occur in the XML.

Demo fiddle #2 here.

dbc
  • 104,963
  • 20
  • 228
  • 340
  • Thanks for putting such effort and detail into your reply! There's a few aspects in there I wasn't aware of, such as the implementation of SelectNodes. There's a lot I've got to try there so I'll go give it a bash and let you know how I get on. – Adam Gripton Nov 06 '19 at 11:05
  • I have a follow-up question: how would your code change if the collection of tags, rather than being straight off the XML root, were in fact themselves contained in a single subtree of the root? – Adam Gripton Nov 06 '19 at 15:27
  • @AdamGripton - please edit your original question to share a sample of XML -- i.e. a [mcve]. Does the python shown in your question work with that XML? – dbc Nov 06 '19 at 16:40
  • Also, is it sufficient to read any `` tag anywhere in the XML, or do you need to restrict that to reading `` tags only at some specified XPath? – dbc Nov 06 '19 at 17:00
  • @AdamGripton - answer updated. But in the future, please take not that the preferred format for questions on stack overflow is [one question per post](https://meta.stackexchange.com/q/222735), so if you have some followup question it may be preferable to accept the current answer, and ask another question. – dbc Nov 06 '19 at 19:10
  • No probs. Really appreciate your help with this one - have used your approach in my code and things seem to be working! Thanks for the advice, I didn't like writing the follow-up myself - think I underestimated how badly behaved some parts of my input were! – Adam Gripton Nov 06 '19 at 20:08