0

I am new to working with comparably large xml-files and now I come across the following problem:

I am using the the lxml package to parse a 348.9 MB xml-file and monitored the RAM usage using the activity monitor on my Mac (macOS 10.13.5). Surprisingly, 6 GB of RAM are occupied after executing the code example below.

from lxml import etree

tree=etree.parse(path_to_file)
root=tree.getroot()

Can anyone explain to me why this happens and suggest an alternative method?

Patrick
  • 41
  • 1
  • 5
  • 2
    It's not suprising at all. Parsing an *entire* file will create nodes in memory for every single element and attribute. These nodes will contain the element's name and contents at least. – Panagiotis Kanavos Jul 20 '18 at 07:47
  • 1
    The alternative is to *not* parse the entire file. Use a SAX parser which reads each element and raises an event that the code has to handle, possibly using the listener pattern. [lxml offers SAX support](https://lxml.de/sax.html). It's more complicated but this way you can handle very large files while reading only one element at a time. – Panagiotis Kanavos Jul 20 '18 at 07:49
  • Yeah, for that big you might want SAX, but honestly, 6GB doesn't seem unmanageable. – pguardiario Jul 20 '18 at 09:39
  • Thank you all for your answers! Since I am working with large xml data for the first time I was not aware of this increased RAM usage. After your combined input and searching for the key words you gave me, the situation becomes clearer. I managed to iteratively parse the file and store the data into a nested python dictionary. This leads to a memory usage of approx. 870 MB for the 350 MB file. I am happy with that. Especially, because now I can process several of these files in parallel subprocesses. So, thank you again! – Patrick Jul 20 '18 at 13:10

0 Answers0