3

I'm trying to parse xml. First iterparse works correctly, but second starts to fill memory. If remove the first iterparse, then nothing changes. Xml is valid.

def clear_element(e):
    e.clear()
    while e.getprevious() is not None:
        del e.getparent()[0]

def import_xml(request):
    f = 'file.xml'
    offers = etree.iterparse(f, events=('end',), tag='offer')
    for event, offer in offers:
        # processing
        # works correctly
        clear_element(offer)

    categories = etree.iterparse(f, events=('end',), tag='category')
    for event, category in categories:
        # using memory
        clear_element(category)

XML:

<shop>
    <categories>
        <category>name</category>
        <category>name</category>
        <category>name</category>
          ~ 1000 categories
    </categories>
    <offers>
        <offer>
           <inner_tag>data</inner_tag>
           <inner_tag>data</inner_tag>
        </offer>
        <offer>
           <inner_tag>data</inner_tag>
           <inner_tag>data</inner_tag>
        </offer>
          ~ 450000 offers
    </offers>
</shop>
Shihal
  • 33
  • 3
  • related: [Why is lxml.etree.iterparse() eating up all my memory?](http://stackoverflow.com/q/12160418/4279) – jfs Oct 18 '14 at 03:13

2 Answers2

3

You're parsing the file twice, the first time you keep all the category tags and drop the offer tags, which for 1000 category tags doesn't take that much memory.

But the second time you only drop the category tags while keeping all 450000 offer tags, that's why building the tree will require a lot of memory.

In such a case it's better not to use the tag argument to iterparse and check for the tagname, while dropping all the unneeded tags:

def import_xml(request):
    f = 'file.xml'
    elements = etree.iterparse(f, events=('end',))
    for event, element in elements:
        if element.tag == 'offer':
            # handle offer ...
        elif element.tag == 'category':
            # handle category ...
        else:
            continue
        element.clear()
        element.getparent().remove(element)

Note: just calling element.clear() without deleting it from the parent would still leave the cleared elements in memory as part of the constructed tree. Probably the clear isn't really needed...

mata
  • 67,110
  • 10
  • 163
  • 162
  • `element.getparent().remove(element)` might not be enough. You might need to remove siblings. See [the related question](http://stackoverflow.com/q/12160418/4279). – jfs Oct 18 '14 at 03:16
  • `AttributeError: 'xml.etree.ElementTree.Element' object has no attribute 'getparent'` – Jack M May 25 '23 at 20:28
  • @jfs not relevant here, the question was about `lxml.etree`, not `xml.etree` – mata Jun 26 '23 at 18:38
  • @mata: the title of the question I've linked: "why is lxml.." – jfs Jul 01 '23 at 05:17
  • @jfs Ah, sorry, my fault, picked the wrong user to reply to - that was actually ment to address the other comment from above – mata Jul 06 '23 at 17:28
0

I was fighting with iterparse for a while as well and now finally think I know how to use it correctly, so here are my words of wisdom on this: When using iterparse:

  1. Make sure to use the cElementTree implementation

  2. Make sure to clear any elements that you do not need along the way. This is in particular important if you have a very complex XML with deep nested structures.

So let's assume your XML had additional nodes like this:

<offers>
    <offer>
       <inner_tag>data</inner_tag>
              <i2>
                    <i3>1000 characters of something</i3>                       
             </i2>
       <inner_tag>data</inner_tag>
    </offer>
</offers>

then your code should look like this:

def import_xml(request):
f = 'file.xml'
elements = etree.iterparse(f, events=('end',))
for event, element in elements:
    if element.tag == 'offer':
        # handle offer ...
    elif element.tag == 'category':
        # handle category ...
    elif element.tag != 'i2':
        continue
    element.clear()

This way, you will omit the complete <i2> nodes with their contents while being able to process any other elements within <offers>

element.getparent().remove(element) does not work in my code (AttributeError).

trincot
  • 317,000
  • 35
  • 244
  • 286
abulhol
  • 171
  • 1
  • 7
  • Hi, thank you for answering. However, the code you provide does not look any different from the code that was provided in the answer that was accepted over a year ago, so it looks like you copied it. Also the two points you raise where already applied by the Asker. Could you explain what you think is the additional value of this answer? – trincot Nov 18 '15 at 08:46
  • Hi trincot, thanks for asking. In all the examples on iterparse I read at Stackoverflow and elsewhere, the XML structure were always super simple, while the files were reported to be gigantic. I was dealing with the opposite: files < 20 MB but with deep nesting. These also filled up the RAM pretty badly, because I was not clearing away all the subelements on the way that I was not looking for. I re-used the code from the first answer because I think it is helpful. But I can see that it is maybe a bit confusing. ;-) – abulhol Nov 18 '15 at 14:15