4

I have a large XML file (about 600 MB) that I am trying to parse using cElementTree with iterparse. First time attempting this.

I am iterating on 'product' tags and elem.clear()-ing after I process each product. Within my parsing I have a function parse_trips which uses a for loop to parse <trip> tags within <trips> tags (each product could potentially have hundreds of these which are each hundreds of lines long).

for trip in trips:
    dump(trip)
    get_date(trip, product)
    set_price(trip, product)

However, when I dump(trips) I see that these tags are getting truncated/closed out early without any error being thrown. The parser seems to reach a maximum length for the elem in memory and then just won't hold anymore.

The raw xml:

<trip>
    <code>text</code>
    <name>text</name>
    <image>img.jpg</image>
    <date>2014-08-10</date>
    <pricing>

    </pricing>
    <itinerary>
        <code>1</code>
        <events>
            <event>
                eventInfo 1
            </event>
            <event>
                eventInfo 2
            </event>
            <event>
                eventInfo 3
            </event>
            <event>
                eventInfo 4
            </event>
            <event>
                eventInfo 5
            </event>
            <event>
                eventInfo 6
            </event>
            <event>
                eventInfo 7
            </event>
            <event>
                eventInfo 8
            </event>
        </events>
    </itinerary>
</trip>

The output I am getting is while there might be 6 such groups, when I reach the second trip in the group, dump(trip) the looks like this:

<trip>
    <code>text</code>
    <name>text</name>
    <image>img.jpg</image>
    <date>2014-08-10</date>
    <pricing></pricing>
    <itinerary>
        <code>1</code>
        <events>
            <event>
                eventInfo 1
            </event>
            <event>
                eventInfo 2
            </event>
            <event>
                eventInfo 3
            </event>
        </events>            
    </itinerary>
</trip>

and every later trip is gone. I tried looping through and just incrementing an integer i to count how many <trip> tags there are, and it only reaches the second one which it truncates and then ends the for loop.

Is there a way to view/configure the size of the element iterparse can grab? Or a way to use iter again once I get to trips to grab ALL child nodes of <trips>?

alsoALion
  • 449
  • 1
  • 5
  • 17
  • The docs say [`dump()` should be used "for debugging only](https://docs.python.org/2/library/xml.etree.elementtree.html). Have you tried using `tostring` instead? It might be a weirdness in the implementation. – Patrick Collins Aug 27 '14 at 04:17
  • I am using dump() for debugging, but even when I remove it, it's clearly still truncating because the later elements don't get parsed. – alsoALion Aug 27 '14 at 04:24
  • Try to reproduce with lxml to find out whether it's a cElementTree issue. – wouter bolsterlee May 03 '15 at 13:16
  • I would try to use [BeautifulSoup](https://pypi.python.org/pypi/beautifulsoup4) instead, which is built ontop of lxml and html5lib – rll Jul 31 '15 at 14:14

0 Answers0