I am attempting to scrape individual fund holdings from the SEC's N-PORT-P/A form using beautiful soup and xml. A typical submission, outlined below and [linked here][1], looks like:
<edgarSubmission xmlns="http://www.sec.gov/edgar/nport" xmlns:com="http://www.sec.gov/edgar/common" xmlns:ncom="http://www.sec.gov/edgar/nportcommon" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
<headerData>
<submissionType>NPORT-P/A</submissionType>
<isConfidential>false</isConfidential>
<accessionNumber>0001145549-23-004025</accessionNumber>
<filerInfo>
<filer>
<issuerCredentials>
<cik>0001618627</cik>
<ccc>XXXXXXXX</ccc>
</issuerCredentials>
</filer>
<seriesClassInfo>
<seriesId>S000048029</seriesId>
<classId>C000151492</classId>
</seriesClassInfo>
</filerInfo>
</headerData>
<formData>
<genInfo>
...
</genInfo>
<fundInfo>
...
</fundInfo>
<invstOrSecs>
<invstOrSec>
<name>ARROW BIDCO LLC</name>
<lei>549300YHZN08M0H3O128</lei>
<title>Arrow Bidco LLC</title>
<cusip>042728AA3</cusip>
<identifiers>
<isin value="US042728AA35"/>
</identifiers>
<balance>115000.000000000000</balance>
<units>PA</units>
<curCd>USD</curCd>
<valUSD>114754.170000000000</valUSD>
<pctVal>0.3967552449</pctVal>
<payoffProfile>Long</payoffProfile>
<assetCat>DBT</assetCat>
<issuerCat>CORP</issuerCat>
<invCountry>US</invCountry>
<isRestrictedSec>N</isRestrictedSec>
<fairValLevel>2</fairValLevel>
<debtSec>
<maturityDt>2024-03-15</maturityDt>
<couponKind>Fixed</couponKind>
<annualizedRt>9.500000000000</annualizedRt>
<isDefault>N</isDefault>
<areIntrstPmntsInArrs>N</areIntrstPmntsInArrs>
<isPaidKind>N</isPaidKind>
</debtSec>
<securityLending>
<isCashCollateral>N</isCashCollateral>
<isNonCashCollateral>N</isNonCashCollateral>
<isLoanByFund>N</isLoanByFund>
</securityLending>
</invstOrSec>
With Arrow Bidco LLC being a bond within the portfolio, with some of its characteristics included within the filing (CUSIP, CIK, balance, maturity date, etc.). I am looking for the best way to iterate through each individual security (investOrSec) and collect the characteristics of each security in a dataframe. The code I am currently using is:
import numpy as np
import pandas as pd
import requests
from bs4 import BeautifulSoup
header = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.75 Safari/537.36", "X-Requested-With": "XMLHttpRequest"}
n_port_file = requests.get("https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml", headers=header, verify=False)
n_port_file_xml = n_port_file.content
soup = BeautifulSoup(n_port_file_xml,'xml')
names = soup.find_all('name')
lei = soup.find_all('lei')
title = soup.find_all('title')
cusip = soup.find_all('cusip')
....
maturityDt = soup.find_all('maturityDt')
couponKind = soup.find_all('couponKind')
annualizedRt = soup.find_all('annualizedRt')
Then iterating through each list to create a dataframe based on the values in each row.
fixed_income_data = []
for i in range(0,len(names)):
rows = [names[i].get_text(),lei[i].get_text(),
title[i].get_text(),cusip[i].get_text(),
balance[i].get_text(),units[i].get_text(),
pctVal[i].get_text(),payoffProfile[i].get_text(),
assetCat[i].get_text(),issuerCat[i].get_text(),
invCountry[i].get_text(),couponKind[i].get_text()
]
fixed_income_data.append(rows)
fixed_income_df = pd.DataFrame(equity_data,columns = ['name',
'lei',
'title',
'cusip',
'balance',
'units',
'pctVal',
'payoffProfile',
'assetCat',
'issuerCat',
'invCountry'
'maturityDt',
'couponKind',
'annualizedRt'
], dtype = float)
This works fine when all pieces of information are included, but often there is one variable that is not accounted for. A piece of the form might be blank, or an issuer category might not have been filled out incorrectly, leading to an IndexError. This portfolio has 127 securities that I was able to parse, but might be missing an annualized return for a single security, throwing off the ability to neatly create a dataframe.
Additionally, for portfolios that hold both fixed income and equity securities, the equity securities do not return information for the debtSecs child. Is there a way to iterate through this data while simultaneously cleaning it in the easiest way possible? Even adding "NaN" for the debtSec children that equity securities don't reference would be a valid response. Any help would be much appreciated! [1]: https://www.sec.gov/Archives/edgar/data/1618627/000114554923004968/primary_doc.xml