1

I'm trying to parse a very big XML file in C# - big enough that some XML tools won't handle it, so I want to handle it sequentially rather than loading it all in. Also, if there are certain errors in the source I want to be able to report the error along with the line number in the XML on which it occurred.

Unfortunately, the XML repeats element names at different levels, something like:

<foo>
    <foo>
        <foo>Something interesting</foo>
    </foo>
    Something else interesting
    <foo>Yes, it's horrid, isn't it?</foo>
</foo>

And I need to keep track of the nesting level at which things occur.

I've tried using XmlTextReader, but I seem to just get a list of foo elements: I can't work out how to track the nesting level. My next thought was to use ReadSubtree on each element so I could use that to let me know when I'd returned from a nesting. But that returns an XmlReader, not an XmlTextReader, so I no longer have access to the line number of the original XML. A websearch suggests using ReadOuterXml to get the text of the node and generate another reader from that, but that appears to read in the entire text so I'm back with my original problem of the file being so big.

So how can I keep track of nesting level (when the element names don't help) and source line number without loading the whole file in?

digitig
  • 1,989
  • 3
  • 25
  • 45
  • I would read one tag at a time that doesn't repeat. You can then parse all the children of the tag using other methods. – jdweng Dec 01 '15 at 00:53

1 Answers1

2

Answering your related questions:

  1. You can cast your XmlReader to an IXmlLineInfo to extract line numbering. Note not all XmlReader implementations implement this interface, but the one returned by XmlReader.Create Method(string inputUri) does. The obsolete XmlTextReader does also.

  2. To get the current depth, use XmlReader.Depth.

  3. More generally, you could maintain a stack of XName classes as you iterate through the file, for instance with:

    public static class XmlReaderExtensions
    {
        public static void WalkXmlNodes(this XmlReader xmlReader, Action<XmlReader, Stack<XName>, IXmlLineInfo> action)
        {
            IXmlLineInfo xmlInfo = xmlReader as IXmlLineInfo;
            try
            {
                Stack<XName> names = new Stack<XName>();
    
                while (xmlReader.Read())
                {
                    if (xmlReader.NodeType == XmlNodeType.Element)
                    {
                        names.Push(XName.Get(xmlReader.LocalName, xmlReader.NamespaceURI));
                    }
    
                    action(xmlReader, names, xmlInfo);
    
                    if ((xmlReader.NodeType == XmlNodeType.Element && xmlReader.IsEmptyElement)
                        || xmlReader.NodeType == XmlNodeType.EndElement)
                    {
                        names.Pop();
                    }
                }
            }
            catch (Exception ex)
            {
                // Rethrow exception with line number information.
                var line = (xmlInfo == null ? -1 : xmlInfo.LineNumber);
                var pos = (xmlInfo == null ? -1 : xmlInfo.LinePosition);
                var xmlException = new XmlException("XmlException occurred", ex, line, pos);
                throw xmlException;
            }
        }
    }
    
dbc
  • 104,963
  • 20
  • 228
  • 340
  • Thanks - it was the IXmlLineInfo interface that I needed (and didn't previously know about). I'd also missed XmlTextReader being obsolete, so I've replaced that. For what it's worth, I'm handling the nesting by calling `ReadSubtree` on the reader and passing the result to an appropriate method (I know what the different nesting levels mean, even if the XML gives me no clue!). When the method returns I know I've popped back up a level. – digitig Dec 02 '15 at 16:13