0
<?xml version="1.0" encoding="UTF-8"?>
<security_master v="47">
    <header>
        <type>initialize</type>
        <timestamp>2023-02-08T02:15:12Z</timestamp>
    </header>
    <payload>
        <instrument id="469408967">
            <master_information>
                <instrument_master>
                </instrument_master>
                <market_master>
                </market_master>
                <organization_master id="321942681">
                </organization_master>
            </master_information>
            <global_information>
            </global_information>
            <debt>
                <fixed_income>
                </fixed_income>
                <bank_loan_details>
                    <amounts>
                    </amounts>
                    <facility>
                    </facility>
                    <loan_contract id="476860356">
                    </loan_contract>
                </bank_loan_details>
            </debt>
        </instrument>
    </payload>
</security_master>

I have a large (~100 MB) XML file containing a bunch of data points related to a set of 'instruments.' I'm concerned with the data point 'loan_contract.' Sample structure is above - there are ~5000 instruments in the file.

My goal is, for each instrument, to get all loan_contract elements. This will eventually go into csv, so I need a distinct list of loan_contracts and their associated instruments, ignoring all instruments without loan_contracts.

A few things to note:

  1. Not all instruments will have an associated loan_contract element (and it's asymmetric so the loan_contract tag would be 'missing' in this case), some will have 1, and others will have more than one (the maximum # is not know)

  2. loan_contract is the great grandchild of instrument

  3. loan_contract and instrument are in the form of

    <loan_contract id="___">

    <instrument id="___">

The code below seems to work, but it is extremely slow and most likely not the most efficient solution. Any feedback/suggestions would be much appreciated.

import xml.etree.ElementTree as ET 
import pandas as pd 

#parse XML file 

tree = ET.parse('/Users/psteward/Documents/py/py/initial.xml')

root = tree.getroot()

#initialize the list for each field 

INSTRUMENT = []
LOAN_CONTRACT_ID = []

for item in root.findall('payload/instrument/debt/bank_loan_details/loan_contract'):
    LOAN_CONTRACT_ID.append(item.get('id'))

for contract in LOAN_CONTRACT_ID:
    INSTRUMENT.append(root.find('payload/instrument/debt/bank_loan_details/loan_contract[@id="%s"]/../../..' % contract).get('id'))

# set up dataframe with the lists 

instruments_df = pd.DataFrame( 
    list(
        zip(
            INSTRUMENT,
            LOAN_CONTRACT_ID
        )
    ),  
    columns =
    ['INSTRUMENT',
    'LOAN_CONTRACT_ID']
)

instruments_df.to_csv("data.csv")
pdsmth
  • 1
  • 1
  • Please post sample XML for [mre]. Please also clarify *extremely slow*: 5 secs? 5 mins? 5 hours? Why use a large library like pandas to convert XML to CSV? – Parfait Feb 22 '23 at 03:14
  • Consider using up-to-date string formatting in Python. The modulo, `%`, operator has been [de-emphasized but not officially deprecated *yet*](https://stackoverflow.com/a/13452357/1422451). – Parfait Feb 22 '23 at 03:16
  • Sample XML posted. As for the time, it takes ~30 mins. I didn't actually time it (I will next time), but it's on that order of magnitude. – pdsmth Feb 22 '23 at 04:05
  • I used pandas because this was the first python program I've written and I pieced it together via medium articles and stack exchange - the suggestion came from one of those and worked, so I moved on...open to other suggestions though. Only real need is outputting a bunch of lists to csv columns. Thanks for the heads-up, I'll update that piece. – pdsmth Feb 22 '23 at 04:12
  • 30 minutes is certainly chronic. I would start by (a) establishing whether performance is linear or quadratic with data size - that will help you narrow the investigation; (b) establishing whether the problem is in parsing/tree building, in tree navigation, or in output generation; which you can do either by profiling, or by experimentally eliminating one or more of the steps. (Alternatively, I sometimes find it useful to *repeat* one of the steps to see how much the time increases). – Michael Kay Feb 22 '23 at 09:54
  • Thanks Michael - those will be good general debugging tips to have on hand. I'll do some testing to see what I can find. – pdsmth Feb 22 '23 at 16:59

3 Answers3

1

You can measure the runtime. I have done this and figured out, that my example is faster (double) as the suggestion from @Parfait. I can't measure @Michael Kay, maybe it#s faster as my suggestion?

import time
import xml.etree.ElementTree as ET
import pandas as pd
start = time.time()

tree = ET.parse("instrument.xml")
root = tree.getroot()

columns = ["Instrument", "Loan_Contract"]
data = []
for elem in root.iter():
    if elem.tag == "instrument":
        instr_id = elem.get('id')
    if elem.tag == "loan_contract":
        loan_id = elem.get('id')
        if loan_id != None:
            row = instr_id, loan_id
            data.append(row)
        
df = pd.DataFrame(data, columns=columns)
print(df)

end = time.time()
print("Execution time:", (end-start) * 10**3, "ms")

Output:

  Instrument Loan_Contract
0  469408967     476860356
Execution time: 8.099794387817383 ms

With this small xml example the bottle neck is the csv writing! If I add csv write it is similar as yours or @Parfaits solution. I am ask me, what is the effect for a larger xml example. Maybe you can make a try and tell us the result.

Hermann12
  • 1,709
  • 2
  • 5
  • 14
0

Consider parsing the nodes relatively to each other with nested loops and use the csv to create the CSV, particularly with csv.DictReader to build rows using dictionaries:

import csv
import xml.etree.ElementTree as ET

doc = ET.parse("Input.xml")

with open("InstrumentsLoanId.csv", "w") as f:
    w = csv.DictWriter(f, fieldnames = ['INSTRUMENT', 'LOAN_CONTRACT_ID'])
    w.writeheader()

    for instrument in doc.findall(".//instrument"):
        for loan in instrument.findall(".//loan_contract"):
            d = {
                "INSTRUMENT": instrument.attrib['id'],
                "LOAN_CONTRACT_ID": loan.attrib['id']
            }

            w.writerow(d)

Possibly, you can get a faster parse by swapping out findall for iterfind where the former first binds all Elements to a list and the latter parses Elements via iterators or generators without holding anything in memory. See What's difference between findall() and iterfind() of xml.etree.ElementTree?

    for instrument in doc.iterfind(".//instrument"):
        for loan in instrument.iterfind(".//loan_contract"):
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • Thanks for the CSV help, Parfait - I'll give that a try. Have been testing findall vs. iterfind and haven't noticed a huge difference, but could be more significant when using the above code. – pdsmth Feb 22 '23 at 17:07
  • Somewhat pedantic question, but is there a reason you prefer "element.attrib['id']" to "element.get('id')" ? – pdsmth Feb 22 '23 at 17:32
  • Deleted my old comment because this worked. I even left the pandas/dataframe portion as is, and it took 5 seconds max. I just needed to keep the line where I append the instrument id to it's list under the 2nd for loop and that maintains the association all the way through. Much appreciated. – pdsmth Feb 22 '23 at 18:25
  • Glad to hear and help! `get` would be better since it will not error out if `id` attribute does not exist in that element. Happy coding! – Parfait Feb 23 '23 at 19:23
0

I would normally use XSLT for this kind of job, and I would expect it to take under 10 seconds.

I suspect that the problem is in your nested loop:

for item in root.findall('payload/instrument/debt/bank_loan_details/loan_contract'):
    LOAN_CONTRACT_ID.append(item.get('id'))

for contract in LOAN_CONTRACT_ID:
    INSTRUMENT.append(root.find('payload/instrument/debt/bank_loan_details/loan_contract[@id="%s"]/../../..' % contract).get('id'))

The first "for" is effectively searching the whole document to produce a list of IDs, and for each of these IDs, you are searching the document again to find it. You need to build an index.

Michael Kay
  • 156,231
  • 11
  • 92
  • 164
  • Agreed that XSLT could be more efficient. This is my first foray into python though, and wanted to solve this real-world problem as my 'introductory' course. Also agreed the nested loop is causing issues. I made a tweak and am testing runtime. – pdsmth Feb 22 '23 at 17:03