Efficiently Parsing Large XML File with Python

Question

I have a 350MB XML file that I need to parse. The problem is that it's a collection of items. I'll post a full sample below, but it's something like:

<?xml blah blah>
<A>
 <B1>
  <details />
  <subdetails />
 <B2>
  <details />
  <subdetails /?
</A>

The issue is that I need loop through all of the B level details and retain everything in each B1 group.

Tried parsing with pandas built in parser lxml. It works slowly and OK for very small XML files but not so great with the 350MB XML file I really need to parse. I understand I may need etree in order to do this. However all examples for that don't let me retain enough information in the loop. Here's my sample XML file, heavily modified and simplified.

<?xml>
<files>
    <file_info>
        <signature>asdf1234lkjh0987</signature>
        <feed_timestamp>1547716688</feed_timestamp>
        <xml_timestamp>1547719291</xml_timestamp>
    </file_info>
    <file>
        <filename>windows.docx</filename>
        <file_id>10001</file_id>
        <cves>
            <cve>CVE-2018-0123</cve>
            <cve>CVE-2019-1357</cve>
        </cves>
        <bids>
            <bid>111</bid>
        </bids>
        <xrefs>
            <xref>ALPHA:ALPHA-ONE-SEVEN</xref>
        </xrefs>
        <preferences>
        </preferences>
        <attributes>
            <attribute>
                <name>cpe</name>
                <value>cpe:/o:microsoft:etc</value>
            </attribute>
            <attribute>
                <name>cvss_temporal_vector</name>
                <value>CVSS2#E:F/RL:OF/RC:ND</value>
            </attribute>
        </attributes>
    </file>
    <file>
        <filename>windows.xlsx</filename>
        <file_id>10002</file_id>
        <cves>
                <cve>CVE-2018-4567</cve>
            <cve>CVE-2019-9876</cve>
        </cves>
        <bids>
            <bid>222</bid>
        </bids>
        <xrefs>
            <xref>ALPHA:CHARLIE-THREE-CHARLIE</xref>
            <xref>OP:BILLOWY BADGER
        </xrefs>
        <preferences>
        </preferences>
        <attributes>
            <attribute>
                <name>cpe</name>
                <value>cpe:/o:microsoft:etc</value>
            </attribute>
                <attribute>
                <name>cvss_temporal_vector</name>
                <value>CVSS2#E:F/RL:OF/RC:ND</value>
            </attribute>
        </attributes>
    </file>
</files>

What I expect is to be able to use pandas to_excel function to output an Excel file that contains a few different tables. The file_id is a unique identifier / primary key for all of this data.

Example tables/sheets to export:

File_ID  |  CVE
10001    |  CVE-2018-0123
10001    |  CVE-2019-1357
10002    |  CVE-2018-4567
10002    |  CVE-2019-9876

File_ID  |  ALPHA
10001    |  ALPHA-ONE-SEVEN
10002    |  CHARLIE-THREE-CHARLIE

Attributes are unique - one Name and Value tag per Attribute entry. Multiple Attribute tags in each file. The following table would use File_ID as the unique / primary key and list everything that was a single-occurrence item.. Example data structure:

File_ID  |    Filename    |          CPE           |  CVSS_Temporal_Vector
10001    |  windows.docx  |  cpe:/o:microsoft:etc  |  CVSS2#E:F/RL:OF/RC:ND
10002    |  windows.xlsx  |  cpe:/o:microsoft:etc  |  CVSS2#E:F/RL:OF/RC:ND

Related: [Using Python Iterparse For Large XML Files](https://stackoverflow.com/q/7171140/190597) — unutbu, Jan 21 '19 at 23:57
I'd seen that solution but I'm not sure how to implement it. — Ford, Jan 22 '19 at 03:01
@Ford: Try this [Iterparse big XML, with low memory footprint, and get all, even nested, Sequence Elements](https://stackoverflow.com/a/53883799/7414759), it's a full working example. — stovfl, Jan 22 '19 at 07:48
@stovfl: Tried my best to implement your solution, but it returns nothing -- as in no printed statements. [I uploaded the .py and the small .xml sample file I ran it against.](https://1drv.ms/f/s!AkatFXL35BRFjWWBuPy51ryLJeoX) One of the other things I'm concerned about is that under Attributes tag I'd need to pull the text from the Name tag as the identifier then the following Value tag as the value. — Ford, Jan 22 '19 at 16:47
@stovfl: Already did. File tag is actually Nasl in my code but that makes no diff. `self.context = etree.iterparse(fh, events=("end",), tag=['Nasl'])` Appreciate your help. — Ford, Jan 22 '19 at 18:13
@Ford: Try with your above XML Data. Before, you have to fix **two Errors** `` shout be a **full** header and `OP:BILLOWY BADGER` have to be **closed** with `` — stovfl, Jan 23 '19 at 07:40

Efficiently Parsing Large XML File with Python

0 Answers0