1

I have around 1000 XML files each of 250 MB size. I need to extract some data from them and write to CSV. There cannot not be any duplicate entries.

I have a system with 4GB RAM and an AMD A8 processor.

I have already gone through some previous posts here but they don't seem to answer my problem.

I have already written the code in Python and tested it on a sample XML and it worked well.

However it was very slow (almost 15 mins for each file) when I used it on all files and had to terminate the process midway.

What can be an optimal solution to speed up the process?

Here's the code

path='data/*.xml'
t=[]
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        t.append((lastName,foreName))
    print('Parsed ',fname)

t=set(t)

I want the fastest method to get the entries without any duplicate values. (Maybe storing in some DB instead of variable t, Will storing each entry in DB speed up due to more free RAM ?- whatever be the method I need direction towards it)

1 Answers1

4

Instead of writing the results to a Python list, create a database table with a UNIQUE constraint, and write all the results to that table. Once all the writing has been done, dump the DB table as a csv.

If you don't want to have any additional dependencies for writing to the DB, I suggest you use sqlite3, as it comes right out of the box with any recent Python installation.

Here's some code to get started:

import sqlite3
conn = sqlite3.connect('large_xml.db')  # db will be created
cur = conn.cursor()
crt = "CREATE TABLE foo(fname VARCHAR(20), lname VARCHAR(20), UNIQUE(fname, lname))"
cur.execute(crt)
conn.commit()

path='data/*.xml'
for fname in glob.glob(path):
    print('Parsing ',fname)
    tree=ET.parse(fname)
    root=tree.getroot()
    x=root.findall('//Article/AuthorList//Author')
    count = 0
    for child in x:
        try:
            lastName=child.find('LastName').text
        except AttributeError:
            lastName=''
        try:
            foreName=child.find('ForeName').text
        except AttributeError:
            foreName=''
        cur.execute("INSERT OR IGNORE INTO foo(fname, lname) VALUES(?, ?)", (foreName, lastName))
        count += 1
        if count > 3000:  # commit every 3000 entries, you can tune this
            count = 0
            conn.commit()

    print('Parsed ',fname)

After the database is populated, dump it to csv as follows:

sqlite3 -header -csv large_xml.db "select * from foo;" > dump.csv

Also, experiment with faster parsing ways. Furthermore, if the .text attribute is available most of the times, following will perhaps be faster than exception handling:

lastName = getattr(child.find('LastName'), 'text', '')
Sam Chats
  • 2,271
  • 1
  • 12
  • 34