Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

Question

I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:

<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
    <formData>
        <genInfo>
        ...
        </genInfo>
        <fundInfo>
        ...
        </fundInfo>
        <invstOrSecs>
            <invstOrSec>
                <name>ARROW BIDCO LLC</name>
                <lei>549300YHZN08M0H3O128</lei>
                <title>Arrow Bidco LLC</title>
                <cusip>042728AA3</cusip>
                <identifiers>
                    <isin value="US042728AA35"/>
                </identifiers>
                <balance>115000.000000000000</balance>
                <units>PA</units>
                <curCd>USD</curCd>
                <valUSD>114754.170000000000</valUSD>
                <pctVal>0.3967552449</pctVal>
                <payoffProfile>Long</payoffProfile>
                <assetCat>DBT</assetCat>
                <issuerCat>CORP</issuerCat>
                <invCountry>US</invCountry>
                <isRestrictedSec>N</isRestrictedSec>
                <fairValLevel>2</fairValLevel>
                <debtSec>
                    <maturityDt>2024-03-15</maturityDt>
                    <couponKind>Fixed</couponKind>
                    <annualizedRt>9.500000000000</annualizedRt>
                    <isDefault>N</isDefault>
                    <areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
                    <isPaidKind>N</isPaidKind>
                </debtSec>
                <securityLending>
                    <isCashCollateral>N</isCashCollateral>
                    <isNonCashCollateral>N</isNonCashCollateral>
                    <isLoanByFund>N</isLoanByFund>
                </securityLending>
            </invstOrSec>

With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe. The code I am currently using is:

import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup

header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}

n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')

names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')

Then iterating through each list to create a dataframe based on the values in each row.

fixed_income_data = []
for i in range(0,len(names)):
    rows = [names[i].get_text(),lei[i].get_text(),
        title[i].get_text(),cusip[i].get_text(),
        balance[i].get_text(),units[i].get_text(),
        pctVal[i].get_text(),payoffProfile[i].get_text(),
        assetCat[i].get_text(),issuerCat[i].get_text(),
        invCountry[i].get_text(),couponKind[i].get_text()
        ]
    fixed_income_data.append(rows)

fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
                         'lei',
                         'title',
                         'cusip',
                         'balance',
                         'units',
                         'pctVal',
                         'payoffProfile',
                         'assetCat',
                         'issuerCat',
                         'invCountry'
                         'maturityDt',
                         'couponKind',
                         'annualizedRt'
                         ], dtype = float)

This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.

Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated! [1]: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml

Unless I missed something, the EDGAR API only supports financials data (10-Q, 10-K, 8-K, etc), as opposed to portfolio holdings (13-F, N-PORT, etc). From their website "Currently included in the APIs are the submissions history by filer and the XBRL data from financial statements (forms 10-Q, 10-K,8-K, 20-F, 40-F, 6-K, and their variants)." — therdawg, Feb 06 '23 at 20:00

score 1 · Accepted Answer · answered Feb 06 '23 at 21:11

Here is the best way, in my opinion, to handle the problem. Generally speaking, EDGAR filings are notoriously difficult to parse, so the following may or may not work on other filings, even from the same filer.

To make it easier on yourself, since this is an XML file, you should use an xml parser and xpath. Given that you're looking to create a dataframe, the most appropriate tool would be the pandas read_xml() method.

Because the XML is nested, you will need to create two different dataframes and concatenate them (maybe others will have a better idea on how to approach it). And, finally, although read_xml() can read directly from a url, in this case, EDGAR requires using a user-agent, meaning you also need to use the requests library as well.

So, all together:

#import required libraries
import pandas as pd
import requests

url = 'https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml'
#set headers with a user-agent
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}    
req =  requests.get(url, headers=headers)

#define the columns you want to drop (based on the data in your question)
to_drop = ['identifiers', 'curCd','valUSD','isRestrictedSec','fairValLevel','debtSec','securityLending']

#the filing uses namespaces (too complicated to get into here), so you need to define that as well
namespaces = {"nport": "http://www.sec.gov/edgar/nport"}

#create the first df, for the securities which are debt instruments
invest = pd.read_xml(req.text,xpath="//nport:invstOrSec[.//nport:debtSec]",namespaces=namespaces).drop(to_drop, axis=1)

#crete the 2nd df, for the debt details:
debt = pd.read_xml(req.text,xpath="//nport:debtSec",namespaces=namespaces).iloc[:,0:3]

#finally, concatenate the two into one df:
pd.concat([invest, debt], axis=1)

This should output your 126 debt securities (pardon the formatting):

lei     title   cusip   balance     units   pctVal  payoffProfile   assetCat    issuerCat   invCountry  maturityDt  couponKind  annualizedRt
0   ARROW BIDCO LLC     549300YHZN08M0H3O128    Arrow Bidco LLC     042728AA3   115000.00   PA  0.396755    Long    DBT     CORP    US  2024-03-15  Fixed   9.50000
1   CD&R SMOKEY BUYER INC   NaN     CD&R Smokey Buyer Inc   12510CAA9   165000.00   PA  0.505585    Long    DBT     CORP    US  2025-07-15  Fixed   6.75000

You can then play with the final df, add or drop columns, etc

That's beautiful--thank you! Yes, I have come to the conclusion that EDGAR is generally not agreeable to work with... the 13-F HMTL isn't terrible, but the N-PORT filing is poorly constructed. Marking this correct, as initially it looks like exactly what I was asking for! Will take some time to play around with it and try to construct a similar DF for an equities only portfolio using the methods you outlined above. Much appreciated once again! — therdawg, Feb 06 '23 at 21:32
@therdawg Glad it worked for you! Not to depress you too much, but wait until you have to start tinkering with XBRL files :) — Jack Fleeting, Feb 06 '23 at 21:33
That will be the next step--I screwed around with XBRL previously and I am aware how tough it is as well, but that is a bridge I will cross when I get to it. If I can ask one follow up question, how would I go about pulling the identifiers? Ideally, that is something that I would have, but the dataframe would fill NaN if the form does not include the identifier. I assume that would require a third dataframe using the namespaces you outlined in your original answer? Apologies for the lack of clarity on my end. — therdawg, Feb 06 '23 at 21:57
@therdawg That may be more complicated because some `` have an `` child but others don't. It may get too complicated for read_xml() and you will then have to resort to other tools. In any case, if it's important, you should post it as a separate question, per SO policy. — Jack Fleeting, Feb 06 '23 at 23:22

Parse SEC EDGAR XML Form Data with child nodes using BeautifulSoup

1 Answers1