Python: How to process large XML file with a lot of childs in 1 root

Question

I have XML file with a data structure like

<report>
  <table>
    <detail name="John" surname="Smith">
    <detail name="Michael" surname="Smith">
    <detail name="Nick" surname="Smith">
    ... {a lot of <detail> elements}
  </table>
</report>

I need to check whether elements with attribute 'name'=='surname'.

XML file is >1 GB, and I have an error trying etree.parse(file).

How can I process elements ony-by-one using Python and LXML?

What's the error that you are getting? – hostingutilities.com Sep 04 '17 at 15:24 — hostingutilities.com, Sep 04 '17 at 15:24
What processing do you want to do? – Bill Bell Sep 04 '17 at 16:05 — Bill Bell, Sep 04 '17 at 16:05

score 3 · Accepted Answer · answered Sep 04 '17 at 18:13

Consider iterparse that allows you to work on elements as the tree is being built. Below checks if name attribute is equivalent to surname attribute. Use the if block to process further like conditionally append values to a list:

import xml.etree.ElementTree as et

data = []
path = "/path/to/source.xml"

# get an iterable
context = et.iterparse(path, events=("start", "end"))

# turn it into an iterator
context = iter(context)

# get the root element
ev, root = next(context)

for ev, el in context:
    if ev == 'start' and el.tag == 'detail':
        print(el.attrib['name'] == el.attrib['surname'])
        data.append([el.attrib['name'], el.attrib['surname']])
        root.clear()

print(data)
# False
# False
# False

# [['John', 'Smith'], ['Michael', 'Smith'], ['Nick', 'Smith']]

newtover · Answer 2 · 2017-09-04T17:04:36.213

There are basically three standard approaches to parsing XML:

building an in-memory Document Object Model (DOM) - you load the whole document into memory and can arbitrarily walk along the tree
writing a pushing SAX parser - processing of the document becomes a sequence of events (an opening tag, the text, an ending tag, comment, processing instruction, etc) to several of which you can subscribe. You register your callbacks and run the parsing. The document is read until the end, but the parser doesn't build in internal representation of the whole document.
writing a pulling StAX parser - the parser streams different events, you sequentially process all of them, but can stop at any time (useful for parsing of XML-metadata at the beginning of the document and stop processing)

lxml is a binding to libxml C library, which is an implementation of DOM, the iterparse method seems to be the implementation of the StAX approach. The SAX parser is built into the python itself: https://docs.python.org/3.6/library/xml.sax.html

For your case the standard approach is to use a SAX parser.

score 0 · Answer 3 · answered Sep 04 '17 at 16:35

You could use the iterparse method, which is meant for handling large xml files. However, your file has an especially simple structure. Using iterparse would be unnecessarily complicated.

I will provide two answers in one script. I answer your question directly by showing how to parse lines in the xml using lxml and I provide what I think is likely to be a better answer using a regex.

The code reads each line in the xml and ignores those lines that do not begin with 'try ... except. When the script finds such a line it passes it to etree from lxml for parsing then displays the attributes from the line. Afterwards it uses a regex to parse out the same attributes and to display them.

I strongly suspect that the regex would be faster.

>>> from lxml import etree
>>> report = '''\
... <report>
...     <table>
...         <detail name="John" surname="Smith">
...         <detail name="Michael" surname="Smith">
...         <detail name="Nick" surname="Smith">
...     </table>
... </report>'''
>>> import re
>>> re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
('John', 'Smith')
>>> for line in report.split('\n'):
...     if line.strip().startswith('<detail'):
...         tree = etree.fromstring(line.replace('>', '/>'))
...         tree.attrib['name'], tree.attrib['surname']
...         re.search(r'name="([^"]*)"\s+surname="([^"]*)', line).groups()
...         
('John', 'Smith')
('John', 'Smith')
('Michael', 'Smith')
('Michael', 'Smith')
('Nick', 'Smith')
('Nick', 'Smith')

Are we using regex on X/HTML? Not a [recommended approach](https://stackoverflow.com/a/1732454/1422451). — Parfait, Sep 04 '17 at 17:42

Python: How to process large XML file with a lot of childs in 1 root

3 Answers3