I have a 350MB XML file that I need to parse. The problem is that it's a collection of items. I'll post a full sample below, but it's something like:
<?xml blah blah>
<A>
<B1>
<details />
<subdetails />
<B2>
<details />
<subdetails /?
</A>
The issue is that I need loop through all of the B level details and retain everything in each B1 group.
Tried parsing with pandas built in parser lxml. It works slowly and OK for very small XML files but not so great with the 350MB XML file I really need to parse. I understand I may need etree in order to do this. However all examples for that don't let me retain enough information in the loop. Here's my sample XML file, heavily modified and simplified.
<?xml>
<files>
<file_info>
<signature>asdf1234lkjh0987</signature>
<feed_timestamp>1547716688</feed_timestamp>
<xml_timestamp>1547719291</xml_timestamp>
</file_info>
<file>
<filename>windows.docx</filename>
<file_id>10001</file_id>
<cves>
<cve>CVE-2018-0123</cve>
<cve>CVE-2019-1357</cve>
</cves>
<bids>
<bid>111</bid>
</bids>
<xrefs>
<xref>ALPHA:ALPHA-ONE-SEVEN</xref>
</xrefs>
<preferences>
</preferences>
<attributes>
<attribute>
<name>cpe</name>
<value>cpe:/o:microsoft:etc</value>
</attribute>
<attribute>
<name>cvss_temporal_vector</name>
<value>CVSS2#E:F/RL:OF/RC:ND</value>
</attribute>
</attributes>
</file>
<file>
<filename>windows.xlsx</filename>
<file_id>10002</file_id>
<cves>
<cve>CVE-2018-4567</cve>
<cve>CVE-2019-9876</cve>
</cves>
<bids>
<bid>222</bid>
</bids>
<xrefs>
<xref>ALPHA:CHARLIE-THREE-CHARLIE</xref>
<xref>OP:BILLOWY BADGER
</xrefs>
<preferences>
</preferences>
<attributes>
<attribute>
<name>cpe</name>
<value>cpe:/o:microsoft:etc</value>
</attribute>
<attribute>
<name>cvss_temporal_vector</name>
<value>CVSS2#E:F/RL:OF/RC:ND</value>
</attribute>
</attributes>
</file>
</files>
What I expect is to be able to use pandas to_excel function to output an Excel file that contains a few different tables. The file_id is a unique identifier / primary key for all of this data.
Example tables/sheets to export:
File_ID | CVE
10001 | CVE-2018-0123
10001 | CVE-2019-1357
10002 | CVE-2018-4567
10002 | CVE-2019-9876
File_ID | ALPHA
10001 | ALPHA-ONE-SEVEN
10002 | CHARLIE-THREE-CHARLIE
Attributes are unique - one Name and Value tag per Attribute entry. Multiple Attribute tags in each file. The following table would use File_ID as the unique / primary key and list everything that was a single-occurrence item.. Example data structure:
File_ID | Filename | CPE | CVSS_Temporal_Vector
10001 | windows.docx | cpe:/o:microsoft:etc | CVSS2#E:F/RL:OF/RC:ND
10002 | windows.xlsx | cpe:/o:microsoft:etc | CVSS2#E:F/RL:OF/RC:ND