How to split an XML file the simple way in Python?

Question

I have Python code for parsing an XML file as detailed here. I understand that XML files are notorious for hogging system resources when manipulated in memory. My solution works for smaller XML files (say 200KB and I have a 340MB file).

I started researching StAX (pull parser) implementation but I am running on a tight schedule and I am looking for a much simpler approach for this task.

I understand the creation of smaller chunks of files but how do I extract the right elements by outputting the main/header tags every time?

For instance, this is the schema :

<?xml version="1.0" ?>
<!--Sample XML Document-->
<bookstore>
    <book Id="1">
      ....
      ....
    </book> 
    <book Id="2">
      ....
      ....
    </book> 
    <book Id="3">
      ....
      ....
    </book> 
    ....
    ....
    ....
    <book Id="n">
      ....
      ....
    </book> 
</bookstore>

How do I create new XML files with header data for every 1000 book elements? For a concrete example of the code and data set, please refer to my other question here. Thanks a lot.

All I want to do is avoid in-memory loading of the dataset all at once. Can we parse the XML file in a streaming fashion? Am I thinking along the right lines?

p.s : My situation is similar to a question asked in 2009. Will post an answer here once I find a simpler solution for my problem. Your feedback is appreciated.

score 8 · Accepted Answer · answered Sep 07 '11 at 17:08

You can parse your big XML file incrementally:

from xml.etree.cElementTree import iterparse

# get an iterable and turn it into an iterator
context = iter(iterparse("path/to/big.xml", events=("start", "end")))

# get the root element
event, root = next(context)
assert event == "start"

for event, elem in context:
    if event == "end" and elem.tag == "book":
       # ... process book elements ...
       root.clear()

score 1 · Answer 2 · answered Sep 07 '11 at 16:59

1

You can use elementtree.iterparse and discard each book tag after it is processed.

answered Sep 07 '11 at 16:59

Mihai Stan

1,052
6
7

1

Better still, use lxml's etree (http://lxml.de/tutorial.html) for a performance boost. – six8 Sep 07 '11 at 17:02
1

@Cixate: it is unclear (without a benchmark) whether `cElementTree.iterparse()` is slower than `lxml.etree.iterparse()` when only parsing is required http://www.ibm.com/developerworks/xml/library/x-hiperfparse/ – jfs Sep 07 '11 at 17:21

How to split an XML file the simple way in Python?

2 Answers2

Linked