I am trying to parse a large XML data dump (18.5 GB) with limited RAM (~6 GB). We only want to grab a few tags from each object and make a hashtable with those tags. We are currently using iterparse (because we can't load the whole file into memory) and xpath (to find the tags we want).
Is this possible?
Here is a sample
context = etree.iterparse(StringIO(xml))
artistReleases = {}
for action, elem in context:
artistName = elem.xpath('/releases/release/artists/artist/name')
releaseName = elem.xpath('/releases/release/title')
i = 0
while i < len(artistName):
artist = artistName[i].text
release = releaseName[i].text
if artist in artistReleases.keys():
artistReleases[artist].append(release)
else:
artistReleases[artist] = release
i += 1
To run an 8mb file, this is taking ~20 min. I am hoping to do 18.5 GB in under a month. :)