0

Suppose I have a large XML (200 - 1000+ MB) and I'm just looking to get a very small subset of data in the most efficient way.

Given a great solution from one of my previous questions, I ended up coding a solution to use an XMLReader mixed with XMLDocument / XPath.

So, supposing I have the following XML:

<Doc>
  <Big_Element1>
      ... LOTS of sub-elements ...
  </Big_Element1>
    .....
  <Small_Element1>
    <Sub_Element1_1 />
      ...
    <Sub_Element1_N />
  </Small_Element1>

   .....

  <Small_Element2>
    <Sub_Element2_1 />
      ...
    <Sub_Element2_N />
  </Small_Element2>

   .....
  <Big_ElementN>
      .......
  </Big_ElementN>
</Doc>

And all I really need is the data from the Small_Elements and the Big_Elements are definitely very large (with many small sub-elements within them) and, so, I'd like to not even enter them if I don't have to.

I came up with this form of solution:

Dim doc As XmlDocument
Dim xNd As XmlNode

Using reader As XmlReader = XmlReader.Create(uri)
        reader.MoveToContent()

        While reader.Read
            If reader.NodeType = XmlNodeType.Element Then

                Select Case UCase(reader.Name)

                    Case "SMALL_ELEMENT1"
                        doc = New XmlDocument
                        xNd = doc.ReadNode(reader)
                        GetSmallElement1Data(xNd)

                    Case "SMALL_ELEMENT2"
                        doc = New XmlDocument
                        xNd = doc.ReadNode(reader)
                        GetSmallElement2Data(xNd)
                End Select
            End If
        End While
End Using

And GetSmallElement1Data(xNd) & GetSmallElement2Data(xNd) are easy enough for me to deal with since they're small and so I use XPath within them to get the data I need.

But my question is that it seems this reader still goes through the entire XML rather than just skipping over the Big_Elements. Or is it not / this the correct way to have programmed this??

Also, I know this sample code was written in VB.net, but I'm equally comfortable with c# / VB.net solutions.

Any help / thoughts would be great!!!

Thanks!!!

Community
  • 1
  • 1
John Bustos
  • 19,036
  • 17
  • 89
  • 151

1 Answers1

2

Suppose I have a large XML (200 - 1000+ MB)

XmlReader is the only approach that does not parse the whole document to create an in memory object model.

But my question is that it seems this reader still goes through the entire XML rather than just skipping over the Big_Elements. Or is it not / this the correct way to have programmed this??

The parser still has to read that content: it has no knowledge of what elements you are interested in.

Your only option to skip content (thus not returning to your code from XmlReader.Read) is to call XmlReader.Skip: telling the parser there are no descendants of the current node you are interested in. The parser will still need to read and parse the text to find the matching end node, but without your code being running this will be quicker.

Richard
  • 106,783
  • 21
  • 203
  • 265
  • Thank you, @Richard. So, my question is, how would I update my current code to incorporate that so as not to step into those bigger nodes? – John Bustos Oct 15 '14 at 16:31
  • Would I somehow incorporate a `Case Else Reader.Skip` kind of logic? I'm just confused how that would look... – John Bustos Oct 15 '14 at 16:35
  • @JohnBustos I would start getting the simplest case (starting extracting one element) getting working with smaller test documents. Only then, given the core of being as efficient as possible is in place, directly address skipping as much as possible. – Richard Oct 15 '14 at 20:00