I have a large XML file (about 600 MB) that I am trying to parse using cElementTree with iterparse. First time attempting this.
I am iterating on 'product' tags and elem.clear()
-ing after I process each product. Within my parsing I have a function parse_trips
which uses a for loop to parse <trip>
tags within <trips>
tags (each product could potentially have hundreds of these which are each hundreds of lines long).
for trip in trips:
dump(trip)
get_date(trip, product)
set_price(trip, product)
However, when I dump(trips)
I see that these tags are getting truncated/closed out early without any error being thrown. The parser seems to reach a maximum length for the elem in memory and then just won't hold anymore.
The raw xml:
<trip>
<code>text</code>
<name>text</name>
<image>img.jpg</image>
<date>2014-08-10</date>
<pricing>
</pricing>
<itinerary>
<code>1</code>
<events>
<event>
eventInfo 1
</event>
<event>
eventInfo 2
</event>
<event>
eventInfo 3
</event>
<event>
eventInfo 4
</event>
<event>
eventInfo 5
</event>
<event>
eventInfo 6
</event>
<event>
eventInfo 7
</event>
<event>
eventInfo 8
</event>
</events>
</itinerary>
</trip>
The output I am getting is while there might be 6 such groups, when I reach the second trip in the group, dump(trip)
the looks like this:
<trip>
<code>text</code>
<name>text</name>
<image>img.jpg</image>
<date>2014-08-10</date>
<pricing></pricing>
<itinerary>
<code>1</code>
<events>
<event>
eventInfo 1
</event>
<event>
eventInfo 2
</event>
<event>
eventInfo 3
</event>
</events>
</itinerary>
</trip>
and every later trip is gone. I tried looping through and just incrementing an integer i
to count how many <trip>
tags there are, and it only reaches the second one which it truncates and then ends the for loop.
Is there a way to view/configure the size of the element iterparse
can grab? Or a way to use iter
again once I get to trips to grab ALL child nodes of <trips>
?