0

I have huge XML files, approx 1GB each. They are so big that it can't be opened even in Notepad++ due to large volume of data.

I edited the XML and was able to parse it through DOMDocument60 (Thanks to the help provided at stackflow).

I was reading similar question at Improve speed of VBA but I am still not able to implement it properly, so need some guidance.

For example:

  1. How to load the xml in SAX ? Do i need to load it in DOMDocument60 before reading in SAX ?
  2. How to read line by line in SAX once imported ? In DOMDocument60, i can easily jump to any node and then loop through the childnodes but not sure how can i do this in SAX ?
  3. What time difference SAX can make to parsing huge XML, compared to DOMDocument? I haven't found any real time example on this.
  4. Is there any better option available than SAX in VBA, like any other library which I can use to speed up the process of Parsing.

Appreciate your suggestions. (Sample XML file is below)

<ParentNode type="actual">

<SampleObject class="POC" version="XYZ123" distName="Test1" id="Sample">
  <p name="name">POC1</p>
  <p name="object1">0</p>
  <p name="object2">6</p>
  <p name="object3">0</p>
</SampleObject>

<SampleObject class="POC" version="XYZ123" distName="Test2" id="Sample">
  <p name="name">POC1</p>
  <p name="object1">2</p>
  <p name="object2">10</p>
  <p name="object4">4</p>
  <p name="object3">6</p>
</SampleObject>

<SampleObject class="POC" version="XYZ123" distName="Test3" id="Sample">
  <p name="name">POC1</p>
  <p name="object2">90</p>
  <p name="object3">0</p>
</SampleObject>

<SampleObject class="POC" version="XYZ123" distName="Test4" id="Sample">
  <p name="name">POC1</p>
  <p name="object1">2</p>
  <p name="object2">10</p>
  <p name="object4">40</p>
  <p name="object3">61</p>
</SampleObject>

braX
  • 11,506
  • 5
  • 20
  • 33
  • The point of SAX is you don't load the whole document in one shot - you parse it "as you go". It's a little more complex than typical DOM-based parsing but there are examples out there (VB6 examples should translate OK to VBA) – Tim Williams May 11 '20 at 22:44
  • https://www.developerfusion.com/article/84405/sax-and-vb-6/ - this is for VB6 but seem to work in VBA. I've not used SAX before but I was up and running in 5 mins using your sample XML – Tim Williams May 11 '20 at 23:49
  • Many Thanks Tim. I looked at some examples online and one of the microsoft example of working with SAX is mentioned at [link](https://learn.microsoft.com/en-us/previous-versions/windows/desktop/ms762677%28v%3Dvs.85%29 ). This example has confused me that possibly i can use DOM load method with sax. I will check out the example which you mentioned to see if i can get it work. – Muhammad Adeel Ahmed May 12 '20 at 00:13
  • Whether SAX is any real advantage depends on what you need to do with the XML content: if you just need to extract the content into some other format or read only part of the content then SAX may work. – Tim Williams May 12 '20 at 00:39
  • Objective is to parse the whole XML file which is 1GB in size and have 1000s of records. DOM process works but it is very slow, takes hours to parse it. I think SAX will work if i read it line by line, just trying to use it for the first time. – Muhammad Adeel Ahmed May 12 '20 at 00:53

1 Answers1

0

Here's where I got following the link I posted above and using your sample XML. Just outputs to the Immediate window: I don't know what you're doing with the extracted data...

Test Method in regular module:

Sub Tester()

    Const FNAME As String = "example.xml"
    Dim rdr As New MSXML2.SAXXMLReader30
    Dim cnth As New ContentHandler

    Set rdr.ContentHandler = cnth
    rdr.parseURL ThisWorkbook.Path & "\" & FNAME  'test xml file is in same folder as the workbook

End Sub

Class module ContentHandler:

Option Explicit

Implements IVBSAXContentHandler

Dim cls, vers, distName, id, pName, pContent
Dim inSO As Boolean, inP As Boolean

Private Sub IVBSAXContentHandler_characters(strChars As String)
    If inP Then Debug.Print "P content:", strChars
End Sub

Private Sub IVBSAXContentHandler_startElement(strNamespaceURI As String, _
                             strLocalName As String, strQName As String, _
                             ByVal oAttributes As MSXML2.IVBSAXAttributes)
    Select Case strLocalName
        Case "SampleObject"
            inSO = True
            cls = oAttributes.getValueFromName("", "class")
            vers = oAttributes.getValueFromName("", "version")
            distName = oAttributes.getValueFromName("", "distName")
            id = oAttributes.getValueFromName("", "id")
            Debug.Print "Start", strLocalName, cls, vers, distName, id
        Case "p"
            inP = True
            pName = oAttributes.getValueFromName("", "name")
            Debug.Print "Start", strLocalName, pName
    End Select
End Sub

Private Sub IVBSAXContentHandler_endElement(strNamespaceURI As String, strLocalName As String, strQName As String)
    Select Case strLocalName
        Case "SampleObject"
            inSO = False
            cls = ""
            vers = ""
            distName = ""
            id = ""
        Case "p"
            pName = ""
            inP = False
    End Select
End Sub

Private Property Set IVBSAXContentHandler_documentLocator( _
                            ByVal RHS As MSXML2.IVBSAXLocator)
End Property

Private Sub IVBSAXContentHandler_startDocument()
End Sub

Private Sub IVBSAXContentHandler_endDocument()
End Sub

Private Sub IVBSAXContentHandler_endPrefixMapping(strPrefix As String)
End Sub

Private Sub IVBSAXContentHandler_ignorableWhitespace(strChars As String)
End Sub

Private Sub IVBSAXContentHandler_processingInstruction(strTarget As String, strData As String)
End Sub

Private Sub IVBSAXContentHandler_skippedEntity(strName As String)
End Sub

Private Sub IVBSAXContentHandler_startPrefixMapping(strPrefix As String, strURI As String)
End Sub
Tim Williams
  • 154,628
  • 8
  • 97
  • 125
  • Many Thanks Tim, I think you defined the code for excel but I will be doing it for access. It will hopefully work in the same way. Once, I am able to parse the xml then I want to store it in Access tables. Attribute, class="POC" will be the table name so I will be creating it at the start. As i said earlier, I am trying to figure it out what time difference it can make if i use SAX compared to DOM. I am sure it will be much quicker. I have also done parsing using linq query in vb.net which seems to work well but then i will be moving away from Office platform (vba). – Muhammad Adeel Ahmed May 12 '20 at 15:40
  • I will give my feedback if i manage to work it. If not then i will report any errors , many thanks – Muhammad Adeel Ahmed May 12 '20 at 15:41
  • Thanks mate, it seems to be working just fine with access, i have to now complete the code for different types of objects and then store the values in access tables. It seems to be reading line by line perfectly – Muhammad Adeel Ahmed May 13 '20 at 00:35
  • Hi mate, there are few problems with this code - first of all i tried to use SAXXMLReader60 instead of SAXXMLReader30 and it gives error at the rdr.parseURL. If i try to use SAXXMLReader30 as you have mentioned then it works while reading the line but if i try to call any Sub in the select statement (IVBSAXContentHandler_startElement) then it generates an error. The error code is again shown with the parseURL. – Muhammad Adeel Ahmed May 13 '20 at 02:19