2

How to parse a big XML file and process its elements as ObjectifiedElement (using objectify parser).

I didn't find any better solution than :

from lxml import etree, objectify
for event, elt in etree.iterparse('onebigfile.xml', tag='MyTag'):
    oelt = objectify.fromstring(etree.tostring(elt))
    my_process(oelt)

How can I avoid this intermediate string representation ?

ElBidoule
  • 169
  • 1
  • 2
  • 9
  • What do you want to do with the elements of `oelt`? – Bill Bell Apr 17 '18 at 14:55
  • @bill-bell For every `MyTag` element, I want to extract and transform its data then update a row in a database. – ElBidoule Apr 17 '18 at 15:14
  • Without knowing details, I wonder if this would be a good application for xslt? – Bill Bell Apr 17 '18 at 15:20
  • There seems to be no way to `objectify` an XML document iteratively in lxml, but you are right that the intermediate tostring/fromstring step is wasteful. When you are already using `iterparse`, could you not skip the `objectify` part and access the nodes under `elt` directly? `objectify`ing XML takes time as well, after all, whereas accessing the XML data directly during parsing is free. – Tomalak Apr 17 '18 at 16:07
  • I don't want to objectify the whole document, I want to objectify the 400.000 nodes "MyTag" inside it. I want to use objectify because I want to benefit from the data-binding offered by it and thus writing nice python code (accessing attributes) instead of using methods from ElementTree. – ElBidoule Apr 17 '18 at 19:50

1 Answers1

1

I think it's really easy to use iterparse to build a custom data extractor that completely removes the need for using objectify.

For the sake of this example, I've used a .NET reference XML file that looks a bit like this:

<doc>
  <assembly>
    <name>System.IO</name>
  </assembly>
  <members>
    <member name="T:System.IO.BinaryReader">
      <summary>Reads primitive data types as binary values in a specific encoding.</summary>
      <filterpriority>2</filterpriority>
    </member>
    <member name="M:System.IO.BinaryReader.#ctor(System.IO.Stream)">
      <summary>Initializes a new instance of the <see cref="T:System.IO.BinaryReader" /> class based on the specified stream and using UTF-8 encoding.</summary>
      <param name="input">The input stream. </param>
      <exception cref="T:System.ArgumentException">The stream does not support reading, is null, or is already closed. </exception>
    </member>
    <member name="M:System.IO.BinaryReader.#ctor(System.IO.Stream,System.Text.Encoding)">
      <summary>Initializes a new instance of the <see cref="T:System.IO.BinaryReader" /> class based on the specified stream and character encoding.</summary>
      <param name="input">The input stream. </param>
      <param name="encoding">The character encoding to use. </param>
      <exception cref="T:System.ArgumentException">The stream does not support reading, is null, or is already closed. </exception>
      <exception cref="T:System.ArgumentNullException">
        <paramref name="encoding" /> is null. </exception>
    </member>
    <!-- ... many more members like this -->
  </members>
</doc>

Assuming you would want to extract all members with their names, summaries and attributes as a list of dicts like this:

{
  'summary': 'Reads primitive data types as binary values in a specific encoding.', 
  'name': 'T:System.IO.BinaryReader'
}
{
  'summary': 'Initializes a new instance of the ', 
  '@input': 'The input stream. ', 
  'name': 'M:System.IO.BinaryReader.#ctor(System.IO.Stream)'
}
{
  'summary': 'Initializes a new instance of the class based on the specified stream and using UTF-8 encoding.', 
  '@input': 'The input stream. ',
  '@encoding': 'The character encoding to use. ',
  'name': 'M:System.IO.BinaryReader.#ctor(System.IO.Stream,System.Text.Encoding)'
}

you could do it like this:

  • use lxml.iterparse with start and end events
  • when a <member> element starts, prepare a new dict (item)
  • when we're inside a <member> element, add anything we're interested in to the dict
  • when the <member> element ends, finalize the dict and yield it
  • setting item to None functions as the "inside/outside of <member>"-flag

In code:

import lxml
from lxml import etree

def text_content(elt):
    return ' '.join([t.strip() for t in elt.itertext()])

def extract_data(xmlfile):
    item = None

    for event, elt in etree.iterparse(xmlfile, events=['start', 'end']):
        if elt.tag == 'member':
            if event == 'start':
                item = {}
            else:
                item['name'] = elt.attrib['name']
                yield item
                item = None

        if item == None:
            continue

        if event == 'end':
            if elt.tag in ('summary', 'returns'):
                item[elt.tag] = text_content(elt)
                continue

            if elt.tag == 'param':
                item['@' + elt.attrib['name']] = text_content(elt)
                continue


testfile = r'C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETCore\v4.5.1\System.IO.xml'

for item in extract_data(testfile):
    print(item)

This way you get the fastest and most memory-efficient parsing and fine control over what data you look at. Using objectify would be more wasteful than that even without the intermediate tostring()/fromstring().

Tomalak
  • 332,285
  • 67
  • 532
  • 628
  • Yes! The most powerful feature of lxml is the ability to do processing on events and coupled with iterparse means the large file can be converted to a generator. This means memory overhead becomes proportional to tree depth. – cowbert Apr 17 '18 at 17:43
  • Sorry but I already know how to do that: in my exemple just delete `oelt = ` and replace `my_process` by `my_process_using_ugly_element_tree`. I want to use objectify for the benefit of it: data-binding and thus writing `my_process` with attributes instead of ElementTree methods. – ElBidoule Apr 17 '18 at 19:51
  • There are no elementtree methods in my code (beyond iterparse), so I don't quite know what you refer to? Anyway, then you must use tostring/fromstring. I have not found any indication in the objectify source code that other approaches work. – Tomalak Apr 17 '18 at 20:59
  • No ElementTree indeed, my bad. But your example is relatively simple compared to what I have to deal with. Manually building a dict like to do is exactly what objectify is for. Thanks for the answser about the source code, I was hoping I missed something... – ElBidoule Apr 18 '18 at 06:01
  • Yeah, but `objectify` sill needs time to objectify *all the XML* even if you only care about a fraction of it. You could also keep your approach with `tag='MyTag'` and use XPath to extract the interesting parts of the data. Or you do `tostring`/`fromstring` and don't care about the inefficiency. Maybe it's fast enough. – Tomalak Apr 18 '18 at 06:46