Parsing very large xml file (18.5 GB) in python using iterparse in lxml with limited RAM. Is there a way?

Question

I am trying to parse a large XML data dump (18.5 GB) with limited RAM (~6 GB). We only want to grab a few tags from each object and make a hashtable with those tags. We are currently using iterparse (because we can't load the whole file into memory) and xpath (to find the tags we want).

Is this possible?

Here is a sample

context = etree.iterparse(StringIO(xml))

artistReleases = {}

for action, elem in context:

    artistName = elem.xpath('/releases/release/artists/artist/name')
    releaseName = elem.xpath('/releases/release/title')

i = 0
while i < len(artistName):
    artist = artistName[i].text
    release = releaseName[i].text
    if artist in artistReleases.keys():
        artistReleases[artist].append(release)
    else:
        artistReleases[artist] = release

    i += 1

To run an 8mb file, this is taking ~20 min. I am hoping to do 18.5 GB in under a month. :)

Are you asking if what you're currently using is possible? Or whether it *would* be possible? — Jon Clements, May 24 '13 at 08:05
Sure it is possible. But clearly you have some doubts borne from some experience with failures. How about you show us what you tried and what went wrong with that? — Martijn Pieters, May 24 '13 at 08:07
I was looking at SAX in Java and got a simple sketch working but found the Python approach way quicker to set up. Just wondering if you could give direction in terms of a framework. If the SAX approach would be better, I would pursue that, or if you have a suggestion of something faster, considering limited RAM, it would be much appreciated. — lewis_r_s, May 24 '13 at 08:18
have you looked at [related questions](http://stackoverflow.com/q/7171140/1258041)? — Lev Levitsky, May 24 '13 at 08:30
As for the question linked @LevLevitsky, make sure to read the IBM article (answer by *unutbu*). — root, May 24 '13 at 08:35
The IBM article looks like it will answer my question. Great resource. I am new to datasets of this size. Thanks for your time. — lewis_r_s, May 24 '13 at 08:56

Parsing very large xml file (18.5 GB) in python using iterparse in lxml with limited RAM. Is there a way?

0 Answers0