I am trying to parse a 17G XML file. I need to read all attributes of each tag () and save them in a CSV. I am using the following code to parse XML. It is taking too much time to parse. For example, 10000 rows take more than 5 minutes to be exported. I have a Windows machine (Intel i5 2.4GHz/8G RAM).
Sample XML
<?xml version="1.0" encoding="utf-8"?>
<comments>
<row Id="2" PostId="35314" Score="8" Text="Yeah, I didn't believe it until I created a console app - but good lord! Why would they give you the rope to hang yourself! I hated that about VB.NET - the OrElse and AndAlso keywords!" CreationDate="2008-09-06T08:09:52.330" UserId="3" />
</comments>
Python Code
import xml.etree.ElementTree as etree
import pandas as pd
df = pd.DataFrame(columns=('Id','PostId','Score','Text','CreationDate','UserId'))
def process_params(elem):
global df
row = [elem.attrib.get('Id'),elem.attrib.get('PostId'),elem.attrib.get('Score'),elem.attrib.get('Text'),elem.attrib.get('CreationDate'),elem.attrib.get('UserId')]
df.loc[len(df)] = row
if (len(df) % 10000 == 0):
with open('xaa.csv', 'a', encoding="utf-8") as f:
df.to_csv(f, header=False)
print("{} rows exported".format(len(df)))
for event, elem in etree.iterparse("xaa.xml", events=('start', 'end')):
if event == 'start':
if elem.tag == 'row':
process_params(elem)
Tried another variant of code without Pandas which fails due to memory outage
def process_params(elem):
row = [elem.attrib.get('Id'),elem.attrib.get('PostId'),elem.attrib.get('Score'),elem.attrib.get('Text'),elem.attrib.get('CreationDate'),elem.attrib.get('UserId')]
with open("comments1.csv", "a", encoding="utf-8") as f:
wr = csv.writer(f)
wr.writerow(row)
for event, elem in etree.iterparse("comments.xml", events=('start', 'end')):
if event == 'start':
if elem.tag == 'row':
process_params(elem)
Tried a few more variants, but nothing worked or takes forever. Please suggest.