Optimize DOM parsing to use lesser memory

Question

self.dom = dom = minidom.parse(datasource)

datasource is a 30MB XML file. This parse statement constructs a tree structure and it consumes almost 2.5GB of RAM which is too for me.

However, my company currently uses only python 2.4 so cant use ElementTree or any latest parsing methods. Switching to SAX parsing also is very costly for me now. So, is there any optimizations with DOM parsing that I can do so that the amount of memory used is reduced?

Also, I wish to know if the parsing of the XML file takes 2.5G or if the tree structure(dom/self.dom) thus generated after parsing takes so much memory? How do I find that?

Parsing larger XML files is typically solved by iterating over the content (using SAX or iterator e.g. from `lxml`), this allows minimal memory footprint for almost any size of source xml document. But being constrained to Python 2.4 (`lxml` is not available from PyPi for that version and it could be a challenge to find and build older version) and not willing to use SAX excludes all the techniques, which proved to be efficient for this kind of task. — Jan Vlcinsky, Jan 09 '15 at 11:15

score 1 · Answer 1 · edited Apr 18 '18 at 19:35

1

From the official doc:

xml.dom.minidom.parse(filename_or_file[, parser[, bufsize]])

You can specify one bufsize so it only ocuppies X amount of memory any given time

edited Apr 18 '18 at 19:35

Billal Begueradj

20,717
43
112
130

answered Jan 09 '15 at 09:58

Álvaro Gómez

306
1
6

The documentation says parser needs to be a SAX2 object. So,do I need to do SAX parsing initially and then give the parser along with the mem size in the above statement? – Ram Jan 09 '15 at 10:17
Yes, you are right. Looking to other similar answers in this website, looks like this library is not very good for your application. Look at some of the alternatives offered here http://stackoverflow.com/questions/26787026/memory-leak-parsing-xml-using-xml-dom-minidom – Álvaro Gómez Jan 09 '15 at 10:24
Is there any way to find the space that self.dom is holding in memory? self.dom = dom = minidom.parse(datasource) – Ram Jan 09 '15 at 10:51
you can use valgrind with python http://stackoverflow.com/questions/20112989/how-to-use-valgrind-with-python for example, but there's lots of python profilers, just search python profiler and you will find a bunch already, they should give you what resources are consuming that function – Álvaro Gómez Jan 09 '15 at 10:53
3058.891 MiB dom1 = minidom.parse(datasource) I used a memory profile 'memory_profiler' like you said and it says 3gb usage for that particular statement.I want to know if 3gb is due to parsing sucha big file or 3gb is due to storing the big tree generated.Any idea? :) Thanks for your help! – Ram Jan 09 '15 at 11:22

Optimize DOM parsing to use lesser memory

1 Answers1