0

I am trying to parse a 17G XML file. I need to read all attributes of each tag () and save them in a CSV. I am using the following code to parse XML. It is taking too much time to parse. For example, 10000 rows take more than 5 minutes to be exported. I have a Windows machine (Intel i5 2.4GHz/8G RAM).

Sample XML

<?xml version="1.0" encoding="utf-8"?>
<comments>
  <row Id="2" PostId="35314" Score="8" Text="Yeah, I didn't believe it until I created a console app - but good lord!  Why would they give you the rope to hang yourself!  I hated that about VB.NET - the OrElse and AndAlso keywords!" CreationDate="2008-09-06T08:09:52.330" UserId="3" /> 
</comments>

Python Code

import xml.etree.ElementTree as etree
import pandas as pd

df = pd.DataFrame(columns=('Id','PostId','Score','Text','CreationDate','UserId'))

def process_params(elem):
    global df
    row = [elem.attrib.get('Id'),elem.attrib.get('PostId'),elem.attrib.get('Score'),elem.attrib.get('Text'),elem.attrib.get('CreationDate'),elem.attrib.get('UserId')]
    df.loc[len(df)] = row

    if (len(df) % 10000 == 0):
        with open('xaa.csv', 'a', encoding="utf-8") as f:
            df.to_csv(f, header=False)
        print("{} rows exported".format(len(df)))

for event, elem in etree.iterparse("xaa.xml", events=('start', 'end')):
    if event == 'start':
        if elem.tag == 'row':
            process_params(elem)

Tried another variant of code without Pandas which fails due to memory outage

def process_params(elem):
    row = [elem.attrib.get('Id'),elem.attrib.get('PostId'),elem.attrib.get('Score'),elem.attrib.get('Text'),elem.attrib.get('CreationDate'),elem.attrib.get('UserId')]
    with open("comments1.csv", "a", encoding="utf-8") as f:
        wr = csv.writer(f)
        wr.writerow(row)

for event, elem in etree.iterparse("comments.xml", events=('start', 'end')):
    if event == 'start':
        if elem.tag == 'row':
            process_params(elem)

Tried a few more variants, but nothing worked or takes forever. Please suggest.

Ali
  • 7,810
  • 12
  • 42
  • 65
  • `etree.iterparse()` still builds the entire tree (compare: https://stackoverflow.com/q/12160418/18771). You don't want that. You can try using the option presented in the answer to the linked thread, or you can try to parse your input with a SAX parser (e.g. https://lxml.de/sax.html). – Tomalak Sep 11 '18 at 04:11
  • I'd also ditch pandas and use a plain regular CSV writer and a file in append mode. One "read" event -> one row written, just like in your second code sample, except that I would not open and close the file for every row, but keep it open the whole time. – Tomalak Sep 11 '18 at 04:19
  • There are a variety of technologies available for processing XML in streamed mode. By avoiding constructing a tree in memory, these will greatly reduce the memory requirement for parsing a large input file. However, parsing itself is a sequential process, and 17Gb in 5 minutes isn't actually a bad outcome. If you require significantly better than this, then you may have to rethink the design of your application. – Michael Kay Sep 11 '18 at 07:36

0 Answers0