Getting subelements using lxml and iterparse

Question

I am trying to write a parsing algorithm to efficiently pull data from an xml document. I am currently rolling through the document based on elements and children, but would like to use iterparse instead. One issue is that I have a list of elements that when found, I want to pull the child data from them, but it seems like using iterparse my options are to filter based on either one element name, or get every single element.

Example xml:

<?xml version="1.0" encoding="UTF-8"?>
<data_object xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
 <source id="0">
  <name>Office Issues</name>
  <datetime>2012-01-13T16:09:15</datetime>
  <data_id>7</data_id>
 </source>
 <event id="125">
  <date>2012-11-06</date>
  <state_id>7</state_id>
 </event>
 <state id="7">
  <name>Washington</name>
 </state>
 <locality id="2">
  <name>Olympia</name>
  <state_id>7</state_id>
  <type>City</type>
 </locality>
 <locality id="3">
  <name>Town</name>
  <state_id>7</state_id>
  <type>Town</type>
 </locality>
</data_object>

Code example:

from lxml import etree

fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]

with open(fname) as xml_doc:
    context = etree.iterparse(xml_doc, events=("start", "end"))

    context = iter(context)

    event, root = context.next()

    base = False
    b_name = ""

    for event, elem in context:
        if event == "start" and elem.tag in ELEMENT_LIST:
            base = True
            bname = elem.tag
            children = elem.getchildren()
            child_list = []
            for child in children:
                child_list.append(child.tag)
            print bname + ":" + str(child_list)
        elif event == "end" and elem.tag in ELEMENT_LIST:
            base = False
            root.clear()

The xml you attached is invalid. – uhz May 07 '12 at 19:20 — uhz, May 07 '12 at 19:20
Thanks, fixed, did not have the "event" closing tag correct – Sam Johnson May 07 '12 at 19:24 — Sam Johnson, May 07 '12 at 19:24

score 1 · Answer 1 · answered May 07 '12 at 19:28

With iterparse you cannot limit parsing to some types of tags, you may do this only with one tag (by passing argument tag). However it is easy to do manually what you would like to achieve. In the following snippet:

from lxml import etree

fname = "test.xml"
ELEMENT_LIST = ["source", "event", "state", "locality"]

with open(fname) as xml_doc:
    context = etree.iterparse(xml_doc, events=("start", "end"))

    for event, elem in context:
        if event == "start" and elem.tag in ELEMENT_LIST:
            print "this elem is interesting, do some processing: %s: [%s]" % (elem.tag, ", ".join(child.tag for child in elem))
        elem.clear()

you limit your search to interesting tags only. Important part of iterparse is the elem.clear() which clears memory when item is obsolete. That is why it is memory efficient, see http://lxml.de/parsing.html#modifying-the-tree

score 0 · Answer 2 · answered May 07 '12 at 20:32

0

I would use XPath instead. It's much more elegant than walking the document on your own and certainly more efficient I assume.

answered May 07 '12 at 20:32

XORcist

4,288
24
32

But that is not a fault with XPath per se. Also, there are ways of mitigating memory bloat: http://stackoverflow.com/a/4696161/395582 – XORcist Oct 09 '12 at 16:21

score 0 · Answer 3 · edited May 23 '17 at 11:58

Use tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url'

Similar question with right answer https://stackoverflow.com/a/7019273/1346222

#!/usr/bin/python
# coding: utf-8
""" Parsing xml file. Basic example """
from StringIO import StringIO
from lxml import etree
import urllib2

sitemap = urllib2.urlopen(
    'http://google.com/sitemap.xml',
    timeout=10
).read()


NS = {
    'x': 'http://www.sitemaps.org/schemas/sitemap/0.9',
    'x2': 'http://www.google.com/schemas/sitemap-mobile/1.0'
}


res = []

urls = etree.iterparse(StringIO(sitemap), tag='{http://www.sitemaps.org/schemas/sitemap/0.9}url')

for event, url in urls:
    t = []
    t = url.xpath('.//x:loc/text() | .//x:priority/text()', namespaces=NS)
    t.append(url.xpath('boolean(.//x2:mobile)', namespaces=NS))
    res.append(t)

Getting subelements using lxml and iterparse

3 Answers3