<?xml version="1.0" encoding="UTF-8"?>
<security_master v="47">
<header>
<type>initialize</type>
<timestamp>2023-02-08T02:15:12Z</timestamp>
</header>
<payload>
<instrument id="469408967">
<master_information>
<instrument_master>
</instrument_master>
<market_master>
</market_master>
<organization_master id="321942681">
</organization_master>
</master_information>
<global_information>
</global_information>
<debt>
<fixed_income>
</fixed_income>
<bank_loan_details>
<amounts>
</amounts>
<facility>
</facility>
<loan_contract id="476860356">
</loan_contract>
</bank_loan_details>
</debt>
</instrument>
</payload>
</security_master>
I have a large (~100 MB) XML file containing a bunch of data points related to a set of 'instruments.' I'm concerned with the data point 'loan_contract.' Sample structure is above - there are ~5000 instruments in the file.
My goal is, for each instrument, to get all loan_contract elements. This will eventually go into csv, so I need a distinct list of loan_contracts and their associated instruments, ignoring all instruments without loan_contracts.
A few things to note:
Not all instruments will have an associated loan_contract element (and it's asymmetric so the loan_contract tag would be 'missing' in this case), some will have 1, and others will have more than one (the maximum # is not know)
loan_contract is the great grandchild of instrument
loan_contract and instrument are in the form of
<loan_contract id="___">
<instrument id="___">
The code below seems to work, but it is extremely slow and most likely not the most efficient solution. Any feedback/suggestions would be much appreciated.
import xml.etree.ElementTree as ET
import pandas as pd
#parse XML file
tree = ET.parse('/Users/psteward/Documents/py/py/initial.xml')
root = tree.getroot()
#initialize the list for each field
INSTRUMENT = []
LOAN_CONTRACT_ID = []
for item in root.findall('payload/instrument/debt/bank_loan_details/loan_contract'):
LOAN_CONTRACT_ID.append(item.get('id'))
for contract in LOAN_CONTRACT_ID:
INSTRUMENT.append(root.find('payload/instrument/debt/bank_loan_details/loan_contract[@id="%s"]/../../..' % contract).get('id'))
# set up dataframe with the lists
instruments_df = pd.DataFrame(
list(
zip(
INSTRUMENT,
LOAN_CONTRACT_ID
)
),
columns =
['INSTRUMENT',
'LOAN_CONTRACT_ID']
)
instruments_df.to_csv("data.csv")