Efficiency of XML parsing script using python ElementTree standard library

Question

<?xml version="1.0" encoding="UTF-8"?>
<security_master v="47">
    <header>
        <type>initialize</type>
        <timestamp>2023-02-08T02:15:12Z</timestamp>
    </header>
    <payload>
        <instrument id="469408967">
            <master_information>
                <instrument_master>
                </instrument_master>
                <market_master>
                </market_master>
                <organization_master id="321942681">
                </organization_master>
            </master_information>
            <global_information>
            </global_information>
            <debt>
                <fixed_income>
                </fixed_income>
                <bank_loan_details>
                    <amounts>
                    </amounts>
                    <facility>
                    </facility>
                    <loan_contract id="476860356">
                    </loan_contract>
                </bank_loan_details>
            </debt>
        </instrument>
    </payload>
</security_master>

I have a large (~100 MB) XML file containing a bunch of data points related to a set of 'instruments.' I'm concerned with the data point 'loan_contract.' Sample structure is above - there are ~5000 instruments in the file.

My goal is, for each instrument, to get all loan_contract elements. This will eventually go into csv, so I need a distinct list of loan_contracts and their associated instruments, ignoring all instruments without loan_contracts.

A few things to note:

Not all instruments will have an associated loan_contract element (and it's asymmetric so the loan_contract tag would be 'missing' in this case), some will have 1, and others will have more than one (the maximum # is not know)
loan_contract is the great grandchild of instrument
loan_contract and instrument are in the form of

<loan_contract id="___">

<instrument id="___">

The code below seems to work, but it is extremely slow and most likely not the most efficient solution. Any feedback/suggestions would be much appreciated.

import xml.etree.ElementTree as ET 
import pandas as pd 

#parse XML file 

tree = ET.parse('/Users/psteward/Documents/py/py/initial.xml')

root = tree.getroot()

#initialize the list for each field 

INSTRUMENT = []
LOAN_CONTRACT_ID = []

for item in root.findall('payload/instrument/debt/bank_loan_details/loan_contract'):
    LOAN_CONTRACT_ID.append(item.get('id'))

for contract in LOAN_CONTRACT_ID:
    INSTRUMENT.append(root.find('payload/instrument/debt/bank_loan_details/loan_contract[@id="%s"]/../../..' % contract).get('id'))

# set up dataframe with the lists 

instruments_df = pd.DataFrame( 
    list(
        zip(
            INSTRUMENT,
            LOAN_CONTRACT_ID
        )
    ),  
    columns =
    ['INSTRUMENT',
    'LOAN_CONTRACT_ID']
)

instruments_df.to_csv("data.csv")

Please post sample XML for [mre]. Please also clarify *extremely slow*: 5 secs? 5 mins? 5 hours? Why use a large library like pandas to convert XML to CSV? — Parfait, Feb 22 '23 at 03:14
Consider using up-to-date string formatting in Python. The modulo, `%`, operator has been [de-emphasized but not officially deprecated *yet*](https://stackoverflow.com/a/13452357/1422451). — Parfait, Feb 22 '23 at 03:16
Sample XML posted. As for the time, it takes ~30 mins. I didn't actually time it (I will next time), but it's on that order of magnitude. — pdsmth, Feb 22 '23 at 04:05
I used pandas because this was the first python program I've written and I pieced it together via medium articles and stack exchange - the suggestion came from one of those and worked, so I moved on...open to other suggestions though. Only real need is outputting a bunch of lists to csv columns. Thanks for the heads-up, I'll update that piece. — pdsmth, Feb 22 '23 at 04:12
30 minutes is certainly chronic. I would start by (a) establishing whether performance is linear or quadratic with data size - that will help you narrow the investigation; (b) establishing whether the problem is in parsing/tree building, in tree navigation, or in output generation; which you can do either by profiling, or by experimentally eliminating one or more of the steps. (Alternatively, I sometimes find it useful to *repeat* one of the steps to see how much the time increases). — Michael Kay, Feb 22 '23 at 09:54
Thanks Michael - those will be good general debugging tips to have on hand. I'll do some testing to see what I can find. — pdsmth, Feb 22 '23 at 16:59

Hermann12 · Answer 1 · 2023-02-22T22:05:27.287

You can measure the runtime. I have done this and figured out, that my example is faster (double) as the suggestion from @Parfait. I can't measure @Michael Kay, maybe it#s faster as my suggestion?

import time
import xml.etree.ElementTree as ET
import pandas as pd
start = time.time()

tree = ET.parse("instrument.xml")
root = tree.getroot()

columns = ["Instrument", "Loan_Contract"]
data = []
for elem in root.iter():
    if elem.tag == "instrument":
        instr_id = elem.get('id')
    if elem.tag == "loan_contract":
        loan_id = elem.get('id')
        if loan_id != None:
            row = instr_id, loan_id
            data.append(row)
        
df = pd.DataFrame(data, columns=columns)
print(df)

end = time.time()
print("Execution time:", (end-start) * 10**3, "ms")

Output:

  Instrument Loan_Contract
0  469408967     476860356
Execution time: 8.099794387817383 ms

With this small xml example the bottle neck is the csv writing! If I add csv write it is similar as yours or @Parfaits solution. I am ask me, what is the effect for a larger xml example. Maybe you can make a try and tell us the result.

score 0 · Answer 2 · answered Feb 22 '23 at 04:58

0

Consider parsing the nodes relatively to each other with nested loops and use the csv to create the CSV, particularly with csv.DictReader to build rows using dictionaries:

import csv
import xml.etree.ElementTree as ET

doc = ET.parse("Input.xml")

with open("InstrumentsLoanId.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames = ['INSTRUMENT', 'LOAN_CONTRACT_ID'])
    w.writeheader()

    for instrument in doc.findall(".//instrument"):
        for loan in instrument.findall(".//loan_contract"):
            d = {
                "INSTRUMENT": instrument.attrib['id'],
                "LOAN_CONTRACT_ID": loan.attrib['id']
            }

            w.writerow(d)

Possibly, you can get a faster parse by swapping out findall for iterfind where the former first binds all Elements to a list and the latter parses Elements via iterators or generators without holding anything in memory. See What's difference between findall() and iterfind() of xml.etree.ElementTree?

    for instrument in doc.iterfind(".//instrument"):
        for loan in instrument.iterfind(".//loan_contract"):

answered Feb 22 '23 at 04:58

Parfait

104,375
17
94
125

Thanks for the CSV help, Parfait - I'll give that a try. Have been testing findall vs. iterfind and haven't noticed a huge difference, but could be more significant when using the above code. – pdsmth Feb 22 '23 at 17:07
Somewhat pedantic question, but is there a reason you prefer "element.attrib['id']" to "element.get('id')" ? – pdsmth Feb 22 '23 at 17:32
Deleted my old comment because this worked. I even left the pandas/dataframe portion as is, and it took 5 seconds max. I just needed to keep the line where I append the instrument id to it's list under the 2nd for loop and that maintains the association all the way through. Much appreciated. – pdsmth Feb 22 '23 at 18:25
Glad to hear and help! `get` would be better since it will not error out if `id` attribute does not exist in that element. Happy coding! – Parfait Feb 23 '23 at 19:23

score 0 · Answer 3 · answered Feb 22 '23 at 09:59

I would normally use XSLT for this kind of job, and I would expect it to take under 10 seconds.

I suspect that the problem is in your nested loop:

for item in root.findall('payload/instrument/debt/bank_loan_details/loan_contract'):
    LOAN_CONTRACT_ID.append(item.get('id'))

for contract in LOAN_CONTRACT_ID:
    INSTRUMENT.append(root.find('payload/instrument/debt/bank_loan_details/loan_contract[@id="%s"]/../../..' % contract).get('id'))

The first "for" is effectively searching the whole document to produce a list of IDs, and for each of these IDs, you are searching the document again to find it. You need to build an index.

Agreed that XSLT could be more efficient. This is my first foray into python though, and wanted to solve this real-world problem as my 'introductory' course. Also agreed the nested loop is causing issues. I made a tweak and am testing runtime. — pdsmth, Feb 22 '23 at 17:03

Efficiency of XML parsing script using python ElementTree standard library

3 Answers3