3

I have a large XML file that needs parsed in VBA (excel 2003 & 2007). There could be upwards of 11,000 'rows' of data in the xml file with each 'row' having between 10 and 20 'columns'. This ends up being a huge task just to parse through and grab the data (5 - 7 minutes). I tried reading the xml and placing each 'row' into a dictionary (key = row number, value = Row Attributes), but this takes just as long.

It is taking forever to traverse the DOM. Is there a more efficient way?

Dim XMLDict
    Sub ParseXML(ByRef RootNode As IXMLDOMNode)
        Dim Counter As Long
        Dim RowList As IXMLDOMNodeList
        Dim ColumnList As IXMLDOMNodeList
        Dim RowNode As IXMLDOMNode
        Dim ColumnNode As IXMLDOMNode
        Counter = 1
        Set RowList = RootNode.SelectNodes("Row")

        For Each RowNode In RowList
            Set ColumnList = RowNode.SelectNodes("Col")
            Dim NodeValues As String
            For Each ColumnNode In ColumnList
                NodeValues = NodeValues & "|" & ColumnNode.Attributes.getNamedItem("id").Text & ":" & ColumnNode.Text
            Next ColumnNode
            XMLDICT.Add Counter, NodeValues
            Counter = Counter + 1
        Next RowNode
    End Sub
Mark Mooibroek
  • 7,636
  • 3
  • 32
  • 53
Doug S.
  • 682
  • 1
  • 10
  • 26
  • What is it that you are trying to achieve? Do you need to iterate over every row and column, or is there a sub-set of data that you actually need. It can be incredibly slow to process if you use SQL to "select *" from your database into a recordset and iterate over every row and column in your table, but that isn't normally how you efficiently process data from a database. Similarly, you can use XPath to select the subset of the XML you need to process rather than iterating over the entire document. – Mads Hansen Apr 12 '11 at 01:38

2 Answers2

7

You could try using SAX instead of DOM. SAX should be faster when all you are doing is parsing the document and the document is non-trivial in size. The reference for the SAX2 implementation in MSXML is here

I typically reach straight for the DOM for most XML parsing in Excel but SAX seems to have advantages in some situations. The short comparison here might help to explain the differences between them.

Here's a hacked-together example (partially based on this) just using Debug.Print for output:

Add a reference to "Microsoft XML, v6.0" via Tools > References

Add this code in a normal module

Option Explicit

Sub main()

Dim saxReader As SAXXMLReader60
Dim saxhandler As ContentHandlerImpl

Set saxReader = New SAXXMLReader60
Set saxhandler = New ContentHandlerImpl

Set saxReader.contentHandler = saxhandler
saxReader.parseURL "file://C:\Users\foo\Desktop\bar.xml"

Set saxReader = Nothing

End Sub

Add a class module, call it ContentHandlerImpl and add the following code

Option Explicit

Implements IVBSAXContentHandler

Private lCounter As Long
Private sNodeValues As String
Private bGetChars As Boolean

Use the left-hand drop-down at the top of the module to choose "IVBSAXContentHandler" and then use the right-hand drop-down to add stubs for each event in turn (from characters to startPrefixMapping)

Add code to some of the stubs as follows

Explicitly set up the counter and the flag to show if we want to read text data at this time

Private Sub IVBSAXContentHandler_startDocument()

lCounter = 0
bGetChars = False

End Sub

Every time a new element starts, check the name of the element and take appropriate action

Private Sub IVBSAXContentHandler_startElement(strNamespaceURI As String, strLocalName As String, strQName As String, ByVal oAttributes As MSXML2.IVBSAXAttributes)

Select Case strLocalName
    Case "Row"
        sNodeValues = ""
    Case "Col"
        sNodeValues = sNodeValues & "|" & oAttributes.getValueFromName(strNamespaceURI, "id") & ":"
        bGetChars = True
    Case Else
        ' do nothing
End Select

End Sub

Check to see if we are interested in the text data and, if we are, chop off any extraneous white space and remove all line feeds (this may or may not be desirable depending on the document you are trying to parse)

Private Sub IVBSAXContentHandler_characters(strChars As String)

If (bGetChars) Then
    sNodeValues = sNodeValues & Replace(Trim$(strChars), vbLf, "")
End If

End Sub

If we have reached the end of a Col then stop reading the text values; if we have reached the end of a Row then print out the string of node values

Private Sub IVBSAXContentHandler_endElement(strNamespaceURI As String, strLocalName As String, strQName As String)

Select Case strLocalName
    Case "Col"
        bGetChars = False
    Case "Row"
        lCounter = lCounter + 1
        Debug.Print lCounter & " " & sNodeValues
    Case Else
        ' do nothing
End Select

End Sub

To make things clearer, here is the full version of ContentHandlerImpl with al of the stub methods in place:

Option Explicit

Implements IVBSAXContentHandler

Private lCounter As Long
Private sNodeValues As String
Private bGetChars As Boolean

Private Sub IVBSAXContentHandler_characters(strChars As String)

If (bGetChars) Then
    sNodeValues = sNodeValues & Replace(Trim$(strChars), vbLf, "")
End If

End Sub

Private Property Set IVBSAXContentHandler_documentLocator(ByVal RHS As MSXML2.IVBSAXLocator)

End Property

Private Sub IVBSAXContentHandler_endDocument()

End Sub

Private Sub IVBSAXContentHandler_endElement(strNamespaceURI As String, strLocalName As String, strQName As String)

Select Case strLocalName
    Case "Col"
        bGetChars = False
    Case "Row"
        lCounter = lCounter + 1
        Debug.Print lCounter & " " & sNodeValues
    Case Else
        ' do nothing
End Select

End Sub

Private Sub IVBSAXContentHandler_endPrefixMapping(strPrefix As String)

End Sub

Private Sub IVBSAXContentHandler_ignorableWhitespace(strChars As String)

End Sub

Private Sub IVBSAXContentHandler_processingInstruction(strTarget As String, strData As String)

End Sub

Private Sub IVBSAXContentHandler_skippedEntity(strName As String)

End Sub

Private Sub IVBSAXContentHandler_startDocument()

lCounter = 0
bGetChars = False

End Sub

Private Sub IVBSAXContentHandler_startElement(strNamespaceURI As String, strLocalName As String, strQName As String, ByVal oAttributes As MSXML2.IVBSAXAttributes)

Select Case strLocalName
    Case "Row"
        sNodeValues = ""
    Case "Col"
        sNodeValues = sNodeValues & "|" & oAttributes.getValueFromName(strNamespaceURI, "id") & ":"
        bGetChars = True
    Case Else
        ' do nothing
End Select

End Sub

Private Sub IVBSAXContentHandler_startPrefixMapping(strPrefix As String, strURI As String)

End Sub
Toothbrush
  • 2,080
  • 24
  • 33
barrowc
  • 10,444
  • 1
  • 40
  • 53
  • `Implements IVBSAXContentHandler` generates a compile error: Object module needs to implement '~' for interface '~'. – Ryszard Jędraszyk May 10 '17 at 17:36
  • @RyszardJędraszyk I've added the full implementation of `ContentHandlerImpl` to the end of the answer. Any missing stub methods would cause an error about methods not being implemented – barrowc May 10 '17 at 23:02
  • @RyszardJędraszyk also [this answer](http://stackoverflow.com/a/29376815/2127508) which links to [this answer](http://stackoverflow.com/a/26604768/2127508) suggests that underscore characters in the name of the Sub being implemented may cause this issue – barrowc May 10 '17 at 23:21
  • Thank You, now it works brilliantly! Missing event subs were the problem. – Ryszard Jędraszyk May 11 '17 at 09:25
  • One more thing: oAttributes.getValueFromName(strNamespaceURI, "id") results in run-time error -2146828283. It's the same for any other method used on oAttributes. Any idea why it happens? – Ryszard Jędraszyk May 11 '17 at 10:35
  • @RyszardJędraszyk The most likely cause is that the element doesn't have an attribute called "id". You can check by iterating through `oAttributes`, like this: `Dim i As Integer : Dim found As Boolean : For i = 0 To (oAttributes.length - 1) : If oAttributes.getLocalName(i) = "id" Then : found = True : Exit For : End If : Next i : MsgBox found` – barrowc May 12 '17 at 00:21
  • 1
    It is true, I actually used a file which has no attributes at all and will need to get values as `strChars` within `Private Sub IVBSAXContentHandler_characters(strChars As String)`. I see that after all, it is really hard to create a flexible solution which will handle many different XML structures. – Ryszard Jędraszyk May 12 '17 at 13:17
0

Use the SelectSingleNode function. This will let you search for a node based on pattern matching.

For instance, I created the following function:

Private Function getXMLNodeValue(ByRef xmlDoc As MSXML2.DOMDocument, ByVal xmlPath As String)
    Dim node As IXMLDOMNode
    Set node = xmlDoc.SelectSingleNode(xmlPath)
    If node Is Nothing Then getXMLNodeValue = vbNullString Else getXMLNodeValue = node.Text
End Function

Now, if I have the following XML file: An XML Response

I can simply call:

myValue = getXMLNodeValue(xmlResult, "//ErrorStatus/Source")

and it will jump through to the first key called 'Error Status' at any depth, and pull out the text in the 'Source' node - returning "INTEGRATION"

Alain
  • 26,663
  • 20
  • 114
  • 184
  • I have to loop through every Row in the document and grab every node. This wouldn't add an improvement in speed for that would it? Isn't SelectSingleNode more for searching? – Doug S. Apr 11 '11 at 20:19
  • I'm sorry, I was under the impression you we navigating thoiugh many unimportant nodes just to get to the ones you cared about. – Alain Apr 12 '11 at 11:52