python cElementTree uses too much memory

Question

I have the following code segment

import xml.etree.cElementTree as et

fstring = open(filename).read()
tree = et.fromstring(fstring)

for el in tree.findall('tag'):
    do stuff

However, fstring is HUGE (~80mbs of data), and I am hiting "Out of memory" error when I try to convert the string to a tree. Is there a way to get around that, perhaps some kind of lazy evaluation of the tree?

Thanks!

EDIT:

I tried using iterparse, and it still gives me MemoryError on the iterparse call. Is there a way to possibly split up the file into multiple chunks and process them one by one?

Depending on what you want to do with the data, you could just write a sax parser - they're extremely lightweight compared to dom parsers. — l4mpi, Nov 06 '12 at 21:35
You can use [`iterparse`](http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse). Also see: http://stackoverflow.com/a/7699801/396458 — NullUserException, Nov 06 '12 at 21:36

score 2 · Accepted Answer · answered Nov 06 '12 at 21:38

2

Take a look at iterparse:

For example, to parse large files, you can get rid of elements as soon as you’ve processed them:
for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()

answered Nov 06 '12 at 21:38

NPE

486,780
108
951
1,012

I am getting "MemoryError" on iterparse(source) line :\ – Jin Nov 06 '12 at 22:17
Are you clearing the elements you've visited, or are you letting them accumulate as you're iterating? – NPE Nov 06 '12 at 22:22
I am calling elem.clear() at end of the for-loop. Also, I am using xml.etree.cElementTree's iterparse, not lxml. I am not sure if that will make any difference. – Jin Nov 06 '12 at 22:25
The problem is the call to iterparse is giving me memoryerror. The actual for-loop is not being executed yet. – Jin Nov 06 '12 at 22:25
I changed it to lxml.etree.iterparse, and I get MemoryError on the same line. The stack trace points to iterparse.__init__ and lxml.etree._encodeFilename functions. – Jin Nov 06 '12 at 22:28
Actually, I was using iterparse wrong. I was passing in the string instead of the filename. However, now my parsing isn't working. I'll try to debug it. Thanks! – Jin Nov 06 '12 at 23:00
`elem.clear()` might not be enough. You could call [`root.clear()`](http://stackoverflow.com/a/13261805/4279) – jfs Nov 07 '12 at 01:05

python cElementTree uses too much memory

1 Answers1