0

I have the following code segment

import xml.etree.cElementTree as et

fstring = open(filename).read()
tree = et.fromstring(fstring)

for el in tree.findall('tag'):
    do stuff

However, fstring is HUGE (~80mbs of data), and I am hiting "Out of memory" error when I try to convert the string to a tree. Is there a way to get around that, perhaps some kind of lazy evaluation of the tree?

Thanks!

EDIT:

I tried using iterparse, and it still gives me MemoryError on the iterparse call. Is there a way to possibly split up the file into multiple chunks and process them one by one?

Jin
  • 6,055
  • 2
  • 39
  • 72
  • Depending on what you want to do with the data, you could just write a sax parser - they're extremely lightweight compared to dom parsers. – l4mpi Nov 06 '12 at 21:35
  • 1
    You can use [`iterparse`](http://docs.python.org/2/library/xml.etree.elementtree.html#xml.etree.ElementTree.iterparse). Also see: http://stackoverflow.com/a/7699801/396458 – NullUserException Nov 06 '12 at 21:36

1 Answers1

2

Take a look at iterparse:

For example, to parse large files, you can get rid of elements as soon as you’ve processed them:

for event, elem in iterparse(source):
    if elem.tag == "record":
        ... process record elements ...
        elem.clear()
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • I am getting "MemoryError" on iterparse(source) line :\ – Jin Nov 06 '12 at 22:17
  • Are you clearing the elements you've visited, or are you letting them accumulate as you're iterating? – NPE Nov 06 '12 at 22:22
  • I am calling elem.clear() at end of the for-loop. Also, I am using xml.etree.cElementTree's iterparse, not lxml. I am not sure if that will make any difference. – Jin Nov 06 '12 at 22:25
  • The problem is the call to iterparse is giving me memoryerror. The actual for-loop is not being executed yet. – Jin Nov 06 '12 at 22:25
  • I changed it to lxml.etree.iterparse, and I get MemoryError on the same line. The stack trace points to iterparse.__init__ and lxml.etree._encodeFilename functions. – Jin Nov 06 '12 at 22:28
  • Actually, I was using iterparse wrong. I was passing in the string instead of the filename. However, now my parsing isn't working. I'll try to debug it. Thanks! – Jin Nov 06 '12 at 23:00
  • `elem.clear()` might not be enough. You could call [`root.clear()`](http://stackoverflow.com/a/13261805/4279) – jfs Nov 07 '12 at 01:05