python lxml module: parsing a 350 MB xml-file uses 6 GB ram, why?

Question

I am new to working with comparably large xml-files and now I come across the following problem:

I am using the the lxml package to parse a 348.9 MB xml-file and monitored the RAM usage using the activity monitor on my Mac (macOS 10.13.5). Surprisingly, 6 GB of RAM are occupied after executing the code example below.

from lxml import etree

tree=etree.parse(path_to_file)
root=tree.getroot()

Can anyone explain to me why this happens and suggest an alternative method?

It's not suprising at all. Parsing an *entire* file will create nodes in memory for every single element and attribute. These nodes will contain the element's name and contents at least. — Panagiotis Kanavos, Jul 20 '18 at 07:47
The alternative is to *not* parse the entire file. Use a SAX parser which reads each element and raises an event that the code has to handle, possibly using the listener pattern. [lxml offers SAX support](https://lxml.de/sax.html). It's more complicated but this way you can handle very large files while reading only one element at a time. — Panagiotis Kanavos, Jul 20 '18 at 07:49
Yeah, for that big you might want SAX, but honestly, 6GB doesn't seem unmanageable. — pguardiario, Jul 20 '18 at 09:39
Thank you all for your answers! Since I am working with large xml data for the first time I was not aware of this increased RAM usage. After your combined input and searching for the key words you gave me, the situation becomes clearer. I managed to iteratively parse the file and store the data into a nested python dictionary. This leads to a memory usage of approx. 870 MB for the 350 MB file. I am happy with that. Especially, because now I can process several of these files in parallel subprocesses. So, thank you again! — Patrick, Jul 20 '18 at 13:10

python lxml module: parsing a 350 MB xml-file uses 6 GB ram, why?

0 Answers0