I'm trying to parse a large number of XML files that include a lot of nested elements to collect specific information to use later on. Due to the large number of files I am trying to do this as efficiently as possible to reduce processing time. I can extract the needed information using xpath as show below but seems very inefficient. Especially having to run a second for loop to extract the result value using another xpath search. I read this post Efficient way to iterate through xml elements and this article High-performance XML parsing in Python with lxml but do not understand how I can apply it to my use case. Is there a more efficient method I can use to get the desired output below? Can I collect the information I need with a single xpath query?
Desired Parsed Format:
Id Object Type Result
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
XML Sample:
<?xml version="1.0" encoding="utf-8"?>
<Data>
<Location localDn="Chicago"/>
<Info Id="Packages">
<job jobId="1"/>
<Type pos="1">totalPackages</Type>
<Value Object="total">
<result pos="1">1200</result>
</Value>
</Info>
<Info Id="DeliveryMethod">
<job jobId="1"/>
<Type pos="1">packagesSent</Type>
<Type pos="2">packagesReceived</Type>
<Value Object="priority">
<result pos="1">100</result>
<result pos="2">100</result>
</Value>
<Value Object="express">
<result pos="1">200</result>
<result pos="2">200</result>
</Value>
<Value Object="ground">
<result pos="1">300</result>
<result pos="2">300</result>
</Value>
</Info>
</Data>
My Method:
from lxml import etree
xml_file = open('stack_sample.xml')
tree = etree.parse(xml_file)
root = tree.getroot()
for elem in tree.xpath('//*'):
if elem.tag == 'Type':
for value in tree.xpath(f'//*/Info[@Id="{elem.getparent().attrib["Id"]}"]/Value/result[@pos="{elem.attrib["pos"]}"]'):
print(elem.getparent().attrib['Id'], value.getparent().attrib['Object'], elem.text, value.text)
Current Ouptut:
Packages total totalPackages 1200
DeliveryMethod priority packagesSent 100
DeliveryMethod express packagesSent 200
DeliveryMethod ground packagesSent 300
DeliveryMethod priority packagesReceived 100
DeliveryMethod express packagesReceived 200
DeliveryMethod ground packagesReceived 300
Is it possible to get all the information by iterating just through tree.xpath('//*')
?