2

I'm trying to parse a large number of XML files that include a lot of nested elements to collect specific information to use later on. Due to the large number of files I am trying to do this as efficiently as possible to reduce processing time. I can extract the needed information using xpath as show below but seems very inefficient. Especially having to run a second for loop to extract the result value using another xpath search. I read this post Efficient way to iterate through xml elements and this article High-performance XML parsing in Python with lxml but do not understand how I can apply it to my use case. Is there a more efficient method I can use to get the desired output below? Can I collect the information I need with a single xpath query?

Desired Parsed Format:

Id             Object    Type             Result
Packages       total     totalPackages    1200
DeliveryMethod priority  packagesSent     100
DeliveryMethod express   packagesSent     200
DeliveryMethod ground    packagesSent     300
DeliveryMethod priority  packagesReceived 100
DeliveryMethod express   packagesReceived 200
DeliveryMethod ground    packagesReceived 300

XML Sample:

<?xml version="1.0" encoding="utf-8"?>
    <Data>
        <Location localDn="Chicago"/>
        <Info Id="Packages">
            <job jobId="1"/>
            <Type pos="1">totalPackages</Type>
            <Value Object="total">
                <result pos="1">1200</result>
            </Value>
        </Info>
        <Info Id="DeliveryMethod">
            <job jobId="1"/>
            <Type pos="1">packagesSent</Type>
            <Type pos="2">packagesReceived</Type>
            <Value Object="priority">
                <result pos="1">100</result>
                <result pos="2">100</result>
            </Value>
            <Value Object="express">
                <result pos="1">200</result>
                <result pos="2">200</result>
            </Value>
            <Value Object="ground">
                <result pos="1">300</result>
                <result pos="2">300</result>
            </Value>
        </Info>
  </Data>

My Method:

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//*'):
    if elem.tag == 'Type':
        for value in tree.xpath(f'//*/Info[@Id="{elem.getparent().attrib["Id"]}"]/Value/result[@pos="{elem.attrib["pos"]}"]'):
            print(elem.getparent().attrib['Id'], value.getparent().attrib['Object'], elem.text, value.text)

Current Ouptut:

Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300

Is it possible to get all the information by iterating just through tree.xpath('//*')?

MBasith
  • 1,407
  • 4
  • 29
  • 48

2 Answers2

4

One of the optimizations will be not going through all tags like you do right now with tree.xpath('//*') and checking with if statement. This can be replace with tree.xpath('//Type')

Next thing that requires optimization is iterating through Values. Instead of iterating through all Value over and over again (tree.xpath('//Value')) you can get all Values that are siblings of tag Type with elem.xpath('./following-sibling::Value')

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//Type'):
    _id = elem.getparent().attrib["Id"]
    _type = elem.text
    _position = elem.attrib["pos"]
    values = elem.xpath('./following-sibling::Value')
    for value in values:
        _object = value.attrib['Object']
        _result = value.xpath(f'./result[@pos={_position}]/text()')[0]
        print(_id, _type, _object, _result)

That will print out:

Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 200
DeliveryMethod packagesSent ground 300
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 200
DeliveryMethod packagesReceived ground 300

EDIT

This is solution for specific case where we are sure that the number of result in Value tag is equal to number of Type tags that are siblings to Value additionally solution assumes that Type and result are ordered by the same pos attribute.

Bear in mind that this is very specific solution not generic one.

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for elem in tree.xpath('//Type'):
    _id = elem.getparent().attrib["Id"]
    _type = elem.text
    _objects = elem.xpath('./following-sibling::Value/@Object')
    _results = elem.xpath('./following-sibling::Value/result/text()')
    for _object, _result in zip(_objects, _results):
            print(_id, _type, _object, _result)

Output:

Packages totalPackages total 1200
DeliveryMethod packagesSent priority 100
DeliveryMethod packagesSent express 100
DeliveryMethod packagesSent ground 200
DeliveryMethod packagesReceived priority 100
DeliveryMethod packagesReceived express 100
DeliveryMethod packagesReceived ground 200
puchal
  • 1,883
  • 13
  • 25
  • Thanks for your response. This also improved efficiency but was not as fast as Andrej's method. Andrejs method processed 2M counters in 21secs and your method did it in 2mins and 51secs. Your method was still a huge improvement over my original method. Thank you for your help. – MBasith Oct 27 '20 at 14:42
  • Hey @MBasith Thanks for your comment. That's true Andrej's method for this purpose is much more efficient. It is nice for you to remember that you can always search with the relative paths. By the way I can with about another even faster solution for the case where there is always the same number of `results` and `types` and they are always in the same order – puchal Oct 27 '20 at 15:17
  • Sure @MBasith , I edited the solution where I added one more. Please bear in mind about all requirements for this case to be applied. – puchal Oct 27 '20 at 15:37
  • Thanks, tried the newer version and did improve the efficiency. Ran the same now in 1Min 6s. Good tip on searching relative paths. – MBasith Oct 27 '20 at 17:34
2

Maybe it will be more performant if you don't iterate over all tags (//*), but just <Value>s:

from lxml import etree

xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()

for val in tree.xpath('//Value'):
    t = {t.get('pos'): t.text for t in val.getparent().xpath('./Type')}
    for r in val.xpath('./result'):
        print(val.getparent().get('Id'), val.get('Object'), t[r.get('pos')], r.text)

Prints:

Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesSent 200
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesSent 300
DeliveryMethod ground packagesReceived 300
Andrej Kesely
  • 168,389
  • 15
  • 48
  • 91
  • 1
    Thanks Andrej. This provided incredible improvement in processing time. Will avoid iterating over //*. – MBasith Oct 27 '20 at 14:12