Parse XML with Python lxml

Question

I am trying to parse a XML using the python library lxml, and would like the resulting output to be in a dataframe. I am relatively new to python and parsing so please bear with me as I outline the problem. The original xml that I am trying to parse is available here

I am interested in obtaining some relevant tags founds in "invstOrSec". Below is a snapshot of one instance of "invstOrSec" where the text accompanying the tag "curCd" is USD.

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>NIPPON LIFE INSURANCE</name>
        <lei>549300Y0HHMFW3EVWY08</lei>
        <curCd>USD</curCd>
    <invstOrSec>

This is relatively straightforward, and my current approach involves first defining the relevant tags in a dictionary and then coarse them into a dataframe in a loop.

    import pandas as pd
    from lxml import etree

    # Declare directory
    os.chdir('C:/Users/A1610222/Desktop/Form NPORT/pkg/sec-edgar-filings/0001548717/NPORT-P/0001752724- 
    20-040624')

    # Set root
    xmlfile = "filing-details.xml"
    tree = etree.parse(xmlfile)
    root = tree.getroot()

    # Remove namespace prefixes
    for elem in root.getiterator():
        elem.tag = etree.QName(elem).localname
   
    # Remove unused namespace declarations
    etree.cleanup_namespaces(root)

    # Set path
    invstOrSec = root.xpath('//invstOrSec')

    # Define tags to extract
    vars = {'invstOrSec' : {'name', 'lei', 'curCd'}

    # Extract holdings data
    sec_info =  pd.DataFrame()
    temp = pd.DataFrame()

    for one in invstOrSec:
        for two in one:
            if two.tag in vars['invstOrSec']:
                temp[two.tag] = [two.text]
        sec_info = sec_info.append(temp)

Here are the top three rows of sec_info

name	lei	curCd
NIPPON LIFE INSURANCE	549300Y0HHMFW3EVWY08	USD
Lloyds Banking Group PLC	549300PPXHEU2JF0AM85	USD
Enbridge Inc	98TPTUM4IVMFCZBCUR27	USD

However, the xml follows a slightly different structure when the currency is not USD. See the below example.

<?xml version="1.0" encoding="UTF-8"?>
    <invstOrSec>
        <name>ACHMEA BV</name>
        <lei>7245007QUMI1FHIQV531</lei>
        <currencyConditional curCd="EUR" exchangeRt="0.89150400"/>
    <invstOrSec>

This time curCd is replaced with a different tag currencyConditional and it contains attributes as opposed to the text. I am having a hard time trying to account for these cases while keeping my code as general as possible. I hope I have managed to illustrate the problem. Again, please excuse me if this is too elementary. Any help would be much appreciated.

score 0 · Accepted Answer · answered Mar 09 '21 at 03:01

0

This is one case where you shouldn't try to reinvent the wheel; use tools developed by others...

import pandas as pd
import pandas_read_xml as pdx

url = 'https://www.sec.gov/Archives/edgar/data/1548717/000175272420040624/primary_doc.xml'

df = pdx.read_xml(url,['edgarSubmission', 'formData', 'invstOrSecs','invstOrSec'])

#because of the non-US currency column, you have to apply one more contortion:
df['currencyConditional'] = df['currencyConditional'].apply(lambda x: x.get('@curCd') if not isinstance(x,float) else "NA" )
df[['name','lei','curCd','currencyConditional']]

Output (partial, obviously) - note the last row:

168     BNP PARIBAS     R0MUWSFPU8MPRO8K5P83    USD     NA
169     Societe Generale    O2RNE8IBXP4R0TD8PU41    USD     NA
170     BARCLAYS PLC    213800LBQA1Y9L22JB70    NaN     GBP

answered Mar 09 '21 at 03:01

Jack Fleeting

24,385
6
23
45

Thank you very much Jack for taking time out to answer my question. Your solution worked like a charm. I was not aware of the pandas_read_xml library. This is indeed a life saver. I just have a quick clarifying question. What is the point of the "if not isinstance(x,float)" in your proposed code. I understand what the syntax does, but I can't seem to follow why we would need it in this context. Again, thanks for your time. – stump Mar 11 '21 at 21:22
1

@stump The column in question contains many `NaN`s. For some reason (I still don't get why), pandas treats them as floats. If you remove the `isinstance()` condition from the list comprehension, an error will be raised. So having the condition is necessary in order to skip them. – Jack Fleeting Mar 11 '21 at 21:40
That is exactly what I wondered about. Weird that pandas treats NaNs as floats. Thank you! – stump Mar 11 '21 at 21:47
I have come far with your help, however I am stuck once again. I am trying to parse the below xml using your proposed procedure "https://www.sec.gov/Archives/edgar/data/1174610/000175272421013053/primary_doc.xml". As earlier, I am interested in accessing the data under the tag "InvstOrSec". However, this time the resulting data frame outputs seven duplicate rows as opposed to one unique row since there is only one occurrence of "invstOrSec" . I have successfully parsed over 1500 xml files, and this is only one being weird. Do you happen to have any clue? Sorry for bugging you once again. – stump Mar 19 '21 at 02:13
1

@stump It's interesting you haven't run into this problem before - the previous filer was investing in equity/debt securities; this one is investing in futures (WTI Crude, to be specific) and the underlying concepts for these are somewhat different. To actually go into this is too long for a comment, so you best post it as a separate question. – Jack Fleeting Mar 19 '21 at 23:02
Thanks for getting back. I posted it as a separate question and got the response from the package owner. Here is the link to my post in case you are interested. https://stackoverflow.com/q/66710039/15357224 – stump Mar 21 '21 at 20:41

Parse XML with Python lxml

1 Answers1