I am trying to parse a XML using the python library lxml, and would like the resulting output to be in a dataframe. I am relatively new to python and parsing so please bear with me as I outline the problem. The original xml that I am trying to parse is available here
I am interested in obtaining some relevant tags founds in "invstOrSec". Below is a snapshot of one instance of "invstOrSec" where the text accompanying the tag "curCd" is USD.
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>NIPPON LIFE INSURANCE</name>
<lei>549300Y0HHMFW3EVWY08</lei>
<curCd>USD</curCd>
<invstOrSec>
This is relatively straightforward, and my current approach involves first defining the relevant tags in a dictionary and then coarse them into a dataframe in a loop.
import pandas as pd
from lxml import etree
# Declare directory
os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724-
20-040624')
# Set root
xmlfile = "filing-details.xml"
tree = etree.parse(xmlfile)
root = tree.getroot()
# Remove namespace prefixes
for elem in root.getiterator():
elem.tag = etree.QName(elem).localname
# Remove unused namespace declarations
etree.cleanup_namespaces(root)
# Set path
invstOrSec = root.xpath('//invstOrSec')
# Define tags to extract
vars = {'invstOrSec' : {'name', 'lei', 'curCd'}
# Extract holdings data
sec_info = pd.DataFrame()
temp = pd.DataFrame()
for one in invstOrSec:
for two in one:
if two.tag in vars['invstOrSec']:
temp[two.tag] = [two.text]
sec_info = sec_info.append(temp)
Here are the top three rows of sec_info
name | lei | curCd |
---|---|---|
NIPPON LIFE INSURANCE | 549300Y0HHMFW3EVWY08 | USD |
Lloyds Banking Group PLC | 549300PPXHEU2JF0AM85 | USD |
Enbridge Inc | 98TPTUM4IVMFCZBCUR27 | USD |
However, the xml follows a slightly different structure when the currency is not USD. See the below example.
<?xml version="1.0" encoding="UTF-8"?>
<invstOrSec>
<name>ACHMEA BV</name>
<lei>7245007QUMI1FHIQV531</lei>
<currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
<invstOrSec>
This time curCd is replaced with a different tag currencyConditional and it contains attributes as opposed to the text. I am having a hard time trying to account for these cases while keeping my code as general as possible. I hope I have managed to illustrate the problem. Again, please excuse me if this is too elementary. Any help would be much appreciated.