1

Suppose I have an XML looking like the following:

<Node1>
    <ChildNd>
        <GrandChildNd>
            <a />
            <b />
        </GrandChildNd>
        ...
        <GrandChildNd>
            <b />
            <c />
        </GrandChildNd>
    </ChildNd>
    ...
    </ChildNd>
</Node1>
...
<NodeN>

In other words, much like most other XMLs, very similar structure between nodes and some repeated attributes / elements within them.

And, since most of my XMLs are > 200MBs, I'm working on creating my own parser using an XMLReader rather than the simpler models of XPath / Linq To XML.

In writing this parser, I've found that I rely very heavily on XMLReader.ReadSubTree to ensure I am always staying within my desired node and knowing that when I close it, I'm at the end of the node I was currently parsing.

So, for example, suppose I wanted to loop through all the <GrandChildNd>s in a particular <ChildNd>, I've coded it something like this:

Using reader As XmlReader = XmlReader.Create(uri)

    reader.ReadToFollowing("Node1")
    reader.ReadToDescendant("ChildNd")
    reader.ReadStartElement("ChildNd")

    ' Loop through all the <GrandChildNd>s
    Do Until reader.NodeType = XmlNodeType.EndElement
        Using GrandChildNdRdr As XmlReader = reader.ReadSubtree
            ParseGrandChild(GrandChildNdRdr)
        End Using

        ' Exit current <GrandChildNd>
        reader.ReadEndElement()
    Loop
End Using

And even within my ParseGrandChild method, I use even more ReadSubTree calls since I find that it ensures me that I won't read anything outside of that current node and when I close that sub-reader, it places me at the end tag of the node I was consuming.

From what I've read online, it seems that the ReadSubTree method is fairly light and not bad to use, but I'm just wondering if, aside from going the XPath / Linq to XML route, there is a batter way to do this / I'm just doing things dead wrong.

This is still all very new to me, so any links / examples would be greatly appreciated!!

Also, I know this sample code was written in VB.NET, but I'm equally comfortable with C# / VB.NET solutions.

Thanks!!

John Bustos
  • 19,036
  • 17
  • 89
  • 151
  • 1
    Just out of curiosity: What is your motivation for writing your own parser? Is there anything wrong with XPath / Linq to XML? – DMAN Oct 29 '14 at 16:12
  • Maybe this SO post answers your question: http://stackoverflow.com/questions/407350/how-best-to-use-xpath-with-very-large-xml-files-in-net/716659#716659 – DMAN Oct 29 '14 at 16:21
  • @DMAN, thanks for the link, but, unfortunately, it doesn't answer my specific question of whether `ReadSubTree()` is good to use or not. As for why I'm writing my own parser, it's because the files are HUGE and even using methods of combining XMLReader with either of those 2 methods still kills me memory-wise... I've written the parser now, I'm just wondering if the method I used is a bad idea or not moreso than using the other methods. Thanks!! – John Bustos Oct 30 '14 at 17:44
  • 2
    check this link about ReadSubTree() - http://msdn.microsoft.com/en-us/library/system.xml.xmlreader.readsubtree.aspx http://stackoverflow.com/questions/2736622/problem-parsing-with-xmlreader-using-readsubtree – Rolwin Crasta Oct 31 '14 at 05:42
  • @RolwinC, thanks for those 2 links! - I definitely used it right in my parser (luckily) in that I know the nodes will be consumed and it allows me to limit the data I'm looking at. I'm just wondering if it's a good way of doing things... So far it doesn't seem anyone is saying it isn't, at least :) – John Bustos Oct 31 '14 at 14:30
  • 1
    @JohnBustos : I don't think there's a better way than using ReadSubTree since you're in dotNet. ReadSubTree is created for this purpose meeting the requirements of dotNet. If you're looking for performance and memory usage, you'd better reinvent the wheel using C (or C++) without some extra security checks of ReadsubTree, assuming you're knowing what you're doing... Yes, look like I have an old XMLParser with pointers and direct access to memory/buffers, sure enough, it's fast and light, but it isn't fail safe when used in the wrong way; now I have more issues using it that solving them. – Karl Stephen Nov 01 '14 at 00:03
  • What exactly are you trying to do with the data after reading it? You can combine XmlReader with LINQ to XML by reading a child node as an `XElement` rather than a whole document... I've found that works very well. – Jon Skeet Nov 03 '14 at 06:53
  • @JonSkeet, My main aim is to only get a very small amount of data and store it in a DB table so it can be queried out in a second program. My challenge is that I need to pull basically then same keys from recursive nodes (i.e. Same element names in parents and children) in a HUGE XML, so a recursive method works very well, but using the `Xelement` approach is terrible since the top parent I'd do this for has lots of children keys and is >100MB itself. MY logic was to recursively use `ReadSubTree` to ensure I know where I am and not kill my RAM. Does that make sense? – John Bustos Nov 03 '14 at 14:18
  • @JohnBustos: Why can't you use XmlReader to get to an element which *isn't* enormous and then load an XElement from that? The requirements are still somewhat unclear to me... – Jon Skeet Nov 03 '14 at 14:30
  • @JonSkeet, I apologize for not stating it so well, the problem is the recursive nature of my tree. I need a few elements from the parent (that is, say, >100MB), then some from it's child (which is, say, 60MB), then from its child, etc. The challenge would be getting the data from the very large parent nodes - since I don't know how to load in a node without its children - and if I load in the entire parent node, then I start facing RAM issues... Especially if I have, say, 3-4 parent nodes I need to dig into. Does what I'm saying make sense? – John Bustos Nov 03 '14 at 14:46
  • 1
    Okay, yes, I see what you mean. Will have a think about it. – Jon Skeet Nov 03 '14 at 14:47
  • If you have ANY idea how to load in just a node without its children, that might work (and might even be even better since then I get all the Linq features available to me), but I was thinking the `ReadSubTree` at least limits my scope to any one particular node and I can use that recursively if each child has the same structure as the parent. I just wasn't sure if that was bad practice.... – John Bustos Nov 03 '14 at 14:49

0 Answers0