Parse multiple large XML files and write to CSV

Question

I have around 1000 XML files each of 250 MB size. I need to extract some data from them and write to CSV. There cannot not be any duplicate entries.

I have a system with 4GB RAM and an AMD A8 processor.

I have already gone through some previous posts here but they don't seem to answer my problem.

I have already written the code in Python and tested it on a sample XML and it worked well.

However it was very slow (almost 15 mins for each file) when I used it on all files and had to terminate the process midway.

What can be an optimal solution to speed up the process?

Here's the code

path='data/*.xml'
t=[]
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        t.append((lastName,foreName))
    print('Parsed ',fname)

t=set(t)

I want the fastest method to get the entries without any duplicate values. (Maybe storing in some DB instead of variable t, Will storing each entry in DB speed up due to more free RAM ?- whatever be the method I need direction towards it)

and for speed problem check this answer: https://stackoverflow.com/questions/18507481/fastest-way-to-parse-xml-in-python — Charif DZ, Sep 07 '19 at 07:39
If you want unique entries then use a set or dict instead of a list. Have you tried lxml as it may be faster? — DisappointedByUnaccountableMod, Sep 07 '19 at 08:15

Sam Chats · Accepted Answer · 2019-09-10T17:27:30.433

4

Instead of writing the results to a Python list, create a database table with a UNIQUE constraint, and write all the results to that table. Once all the writing has been done, dump the DB table as a csv.

If you don't want to have any additional dependencies for writing to the DB, I suggest you use sqlite3, as it comes right out of the box with any recent Python installation.

Here's some code to get started:

import sqlite3
conn = sqlite3.connect('large_xml.db')  # db will be created
cur = conn.cursor()
crt = "CREATE TABLE foo(fname VARCHAR(20), lname VARCHAR(20), UNIQUE(fname, lname))"
cur.execute(crt)
conn.commit()

path='data/*.xml'
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    count = 0
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        cur.execute("INSERT OR IGNORE INTO foo(fname, lname) VALUES(?, ?)", (foreName, lastName))
        count += 1
        if count > 3000:  # commit every 3000 entries, you can tune this
            count = 0
            conn.commit()

    print('Parsed ',fname)

After the database is populated, dump it to csv as follows:

sqlite3 -header -csv large_xml.db "select * from foo;" > dump.csv

Also, experiment with faster parsing ways. Furthermore, if the .text attribute is available most of the times, following will perhaps be faster than exception handling:

lastName = getattr(child.find('LastName'), 'text', '')

edited Sep 10 '19 at 17:27

answered Sep 07 '19 at 07:27

Sam Chats

2,271
1
12
34

Okay, I will try this one out and revert back in comment. Let's hope it works. – Sayangdipto Chakraborty Sep 07 '19 at 07:31
@SayangdiptoChakraborty Perhaps some of the starter code above might help. – Sam Chats Sep 07 '19 at 07:44
this code is inserting duplicate entries into the table. I clearly mentioned that I do not want any duplicate entries.. @Sam – Sayangdipto Chakraborty Sep 07 '19 at 17:48
I want lastName, foreName together to be UNIQUE attribute and not separately. That is Michael, S and Michael, S will be treated same while Michael, K and Michael, S will be treated different. Any idea how to do this? – Sayangdipto Chakraborty Sep 07 '19 at 18:00
I __want__ a holiday. It doesn’t seem like you are actually *trying* to solve your own problem - what have you _done_ to solve it? – DisappointedByUnaccountableMod Sep 07 '19 at 21:22
Speed is obviously better because it's just parsing and pushing into the DB. – Sayangdipto Chakraborty Sep 08 '19 at 05:18
But my main point was not having duplicate entries. INSERT OR IGNORE INTO isn't giving that. I checked. – Sayangdipto Chakraborty Sep 08 '19 at 05:20
1

@SamChats I have done it. Just added the PRIMARY KEY (LastName,ForeName) Thanks didn't have much knowledge about SQL so sorry for bothering. – Sayangdipto Chakraborty Sep 08 '19 at 07:19
@SayangdiptoChakraborty Primary key works, so will a `UNIQUE` constraint. – Sam Chats Sep 10 '19 at 17:27

Parse multiple large XML files and write to CSV

1 Answers1