23

A simplified version of my XML parsing function is here:

import xml.etree.cElementTree as ET

def analyze(xml):
    it = ET.iterparse(file(xml))
    count = 0

    for (ev, el) in it:
        count += 1

    print('count: {0}'.format(count))

This causes Python to run out of memory, which doesn't make a whole lot of sense. The only thing I am actually storing is the count, an integer. Why is it doing this:

enter image description here

See that sudden drop in memory and CPU usage at the end? That's Python crashing spectacularly. At least it gives me a MemoryError (depending on what else I am doing in the loop, it gives me more random errors, like an IndexError) and a stack trace instead of a segfault. But why is it crashing?

NullUserException
  • 83,810
  • 28
  • 209
  • 234
Aillyn
  • 23,354
  • 24
  • 59
  • 84
  • 13
    http://stackoverflow.com/questions/1513592/python-is-there-an-xml-parser-implemented-as-a-generator/1513640#1513640 recommends calling `.clear()` on each element when you're done with it to save memory. Presumably this works because cElementTree keeps the previously-returned values in memory otherwise. – Wooble Oct 08 '11 at 15:19
  • @Wooble You should post that as an answer. Nailed it. – Aillyn Oct 08 '11 at 15:27
  • Also, I've had good success with `lxml`; it has identical (AFAIK) functionality, but is much more memory and time efficient. – user Oct 08 '11 at 18:27
  • 1
    @Oliver `lxml` beats `ElementTree`, but not `cElementTree` when it comes to parsing. – Aillyn Oct 08 '11 at 20:25
  • 1
    @Wooble: In all 3 ElementTree implementations, `iterparse()` builds the tree. It is up to the caller to delete unwanted elements. – John Machin Oct 08 '11 at 20:33
  • 1
    Just a note: this issue seems to not affect the memory on my Mac at all, but causes my Ubuntu server to hemorrhage RAM like it's going out of style. – Mike Davlantes Jun 12 '20 at 18:52

1 Answers1

6

Code example:

import xml.etree.cElementTree as etree

def getelements(filename_or_file, tag):
    context = iter(etree.iterparse(filename_or_file, events=('start', 'end')))
    _, root = next(context) # get root element
    for event, elem in context:
        if event == 'end' and elem.tag == tag:
            yield elem
            root.clear() # preserve memory
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Shouldn't you invoke `clear()` on `elem` as well? Or are you certain that just clearing the root will cause the garbage collector to collect the element as well? – Henrik Heimbuerger Apr 04 '13 at 16:09
  • 1
    @hheimbuerger: `root.clear()` is enough. I haven't dig to deep but the memory usage was small when I used it to parse large xml files. – jfs Apr 04 '13 at 21:52