ElementTree iterparse strategy

Question

I have to handle xml documents that are big enough (up to 1GB) and parse them with python. I am using the iterparse() function (SAX style parsing).

My concern is the following, imagine you have an xml like this

<?xml version="1.0" encoding="UTF-8" ?>
<families>
  <family>
    <name>Simpson</name>
    <members>
        <name>Homer</name>
        <name>Marge</name>
        <name>Bart</name>
    </members>
  </family>
  <family>
    <name>Griffin</name>
    <members>
        <name>Peter</name>
        <name>Brian</name>
        <name>Meg</name>
    </members>
  </family>
</families>

The problem is, of course to know when I am getting a family name (as Simpsons) and when I am getting the name of one of that family member (for example Homer)

What I have been doing so far is to use "switches" which will tell me if I am inside a "members" tag or not, the code will look like this

import xml.etree.cElementTree as ET

__author__ = 'moriano'

file_path = "test.xml"
context = ET.iterparse(file_path, events=("start", "end"))

# turn it into an iterator
context = iter(context)
on_members_tag = False
for event, elem in context:
    tag = elem.tag
    value = elem.text
    if value :
        value = value.encode('utf-8').strip()

    if event == 'start' :
        if tag == "members" :
            on_members_tag = True

        elif tag == 'name' :
            if on_members_tag :
                print "The member of the family is %s" % value
            else :
                print "The family is %s " % value

    if event == 'end' and tag =='members' :
        on_members_tag = False
    elem.clear()

And this works fine as the output is

The family is Simpson 
The member of the family is Homer
The member of the family is Marge
The member of the family is Bart
The family is Griffin 
The member of the family is Peter
The member of the family is Brian
The member of the family is Meg

My concern is that with this (simple) example i had to create an extra variable to know in which tag i was (on_members_tag) imagine with the true xml examples that I have to handle, they have more nested tags.

Also note that this is a very reduced example, so you can assume that i may be facing an xml with more tags, more inner tags and trying to get different tag names, attributes and so on.

So question is. Am I doing something horribly stupid here? I feel like there must be a more elegant solution to this.

What will you do with the data? Construct a Python data structure to hold it all, or store in db while iterating, or something else? — Janne Karila, Oct 09 '12 at 05:25
@JanneKarila : The data could be put on python structure, save to db or dump into a file, it would depend on the proccess, in this case you can assume that it will be written to db — Juan Antonio Gomez Moriano, Oct 09 '12 at 05:33

nneonneo · Accepted Answer · 2012-10-09T06:31:37.253

35

Here's one possible approach: we maintain a path list and peek backwards to find the parent node(s).

path = []
for event, elem in ET.iterparse(file_path, events=("start", "end")):
    if event == 'start':
        path.append(elem.tag)
    elif event == 'end':
        # process the tag
        if elem.tag == 'name':
            if 'members' in path:
                print 'member'
            else:
                print 'nonmember'
        path.pop()

edited Oct 09 '12 at 06:31

answered Oct 09 '12 at 06:24

nneonneo

171,345
36
312
383

Simple, elegant and does the job. Thanks a lot :) – Juan Antonio Gomez Moriano Oct 09 '12 at 22:02
Is there any standard name for this approach? I believe this approach is used for many such problems. If you could tell the name of it, I can dig deeper and understand this. – Sadanand Upase Oct 25 '17 at 03:18
While I like this neat solution, I think it worth noting that it may be less performant than the original approach with boolean flag, because search in the list is introduced. Even though for usual documents max depth will be quite reasonable (10-20 levels?), it *may* get bad for very nested trees – The Godfather Oct 06 '19 at 23:40
1

@TheGodfather good point. If performance is a concern, you can replace the `path` list with a `collections.Counter()` object, then do `path[elem.tag] += 1` and `path[elem.tag] -= 1` instead of `append` and `pop`. This has O(1) amortized lookup, and the counter size is proportional to the number of *unique* tag names which makes it space-efficient. – nneonneo Oct 09 '19 at 13:28

score 17 · Answer 2 · answered Oct 09 '12 at 07:24

17

pulldom is excellent for this. You get a sax stream. You can iterate through the stream, and when you find a node that your are interested in, load that node in to a dom fragment.

import xml.dom.pulldom as pulldom
import xpath # from http://code.google.com/p/py-dom-xpath/

events = pulldom.parse('families.xml')
for event, node in events:
    if event == 'START_ELEMENT' and node.tagName=='family':
        events.expandNode(node) # node now contains a dom fragment
        family_name = xpath.findvalue('name', node)
        members = xpath.findvalues('members/name', node)
        print('family name: {0}, members: {1}'.format(family_name, members))

output:

family name: Simpson, members: [u'Hommer', u'Marge', u'Bart']
family name: Griffin, members: [u'Peter', u'Brian', u'Meg']

answered Oct 09 '12 at 07:24

Gary van der Merwe

9,134
3
49
80

This is a very nice solution, however I cannot give you it as an accepted answer (I like nneonneo's answer better), however, it definately looks like an elegant solution. Thanks! – Juan Antonio Gomez Moriano Oct 09 '12 at 22:01
Great answer. Very simple to use. allowed to parse a 46 GB xml file – Hani Goc Oct 15 '15 at 10:58
1

The unfortunate problem with this is that it is so exceedingly slow. Quickly comparing cElementTree.parse and pulldom.parse methods gave me 1 minute for cElementTree and 10 minutes for pulldom.parse. A 10X increase in time is insane for large files. – chase Aug 20 '19 at 19:42

ElementTree iterparse strategy

2 Answers2

Linked