How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

Question

Apologies in advance for long question- I am new to Python and I'm trying to be as explicit as I can with a fairly specific situation.

I am trying to identify specific data points from SEC Filings on a routine basis however I want to automate this instead of having to manually go search a companies CIK ID and Form filing. So far, I have been able to get to a point where I am downloading metadata about all filings received by the SEC in a given time period. It looks like this:

index   cik         conm             type        date           path
0   0   1000045 NICHOLAS FINANCIAL INC  10-Q   2019-02-14   edgar/data/1000045/0001193125-19-039489.txt
1   1   1000045 NICHOLAS FINANCIAL INC  4   2019-01-15  edgar/data/1000045/0001357521-19-000001.txt
2   2   1000045 NICHOLAS FINANCIAL INC  4   2019-02-19  edgar/data/1000045/0001357521-19-000002.txt
3   3   1000045 NICHOLAS FINANCIAL INC  4   2019-03-15  edgar/data/1000045/0001357521-19-000003.txt
4   4   1000045 NICHOLAS FINANCIAL INC  8-K 2019-02-01  edgar/data/1000045/0001193125-19-024617.txt

Despite having all this information, as well as being able to download these text files and see the underlying data, I am unable to parse this data as it is in xbrl format and is a bit out of my wheelhouse. Instead I came across this script (kindly provided from this site https://www.codeproject.com/Articles/1227765/Parsing-XBRL-with-Python):

from bs4 import BeautifulSoup
import requests
import sys

# Access page
cik = '0000051143'
type = '10-K'
dateb = '20160101'

# Obtain HTML for search page
base_url = "https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={}&type={}&dateb={}"
edgar_resp = requests.get(base_url.format(cik, type, dateb))
edgar_str = edgar_resp.text

# Find the document link
doc_link = ''
soup = BeautifulSoup(edgar_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile2')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if '2015' in cells[3].text:
            doc_link = 'https://www.sec.gov' + cells[1].a['href']

# Exit if document link couldn't be found
if doc_link == '':
    print("Couldn't find the document link")
    sys.exit()

# Obtain HTML for document page
doc_resp = requests.get(doc_link)
doc_str = doc_resp.text

# Find the XBRL link
xbrl_link = ''
soup = BeautifulSoup(doc_str, 'html.parser')
table_tag = soup.find('table', class_='tableFile', summary='Data Files')
rows = table_tag.find_all('tr')
for row in rows:
    cells = row.find_all('td')
    if len(cells) > 3:
        if 'INS' in cells[3].text:
            xbrl_link = 'https://www.sec.gov' + cells[2].a['href']

# Obtain XBRL text from document
xbrl_resp = requests.get(xbrl_link)
xbrl_str = xbrl_resp.text

# Find and print stockholder's equity
soup = BeautifulSoup(xbrl_str, 'lxml')
tag_list = soup.find_all()
for tag in tag_list:
    if tag.name == 'us-gaap:stockholdersequity':
        print("Stockholder's equity: " + tag.text)

Just running this script works exactly how I'd like it to. It returns the stockholders equity for a given company (IBM in this case) and I can then take that value and write it to an excel file.

My two-part question is this:

I took the three relevant columns (CIK, type, and date) from my original metadata table above and wrote it to a list of tuples - I think thats what its called- it looks like this [('1009759', 'D', '20190215'),('1009891', 'D', '20190206'),...]). How do I take this data, replace the initial part of the script I found, and loop through it efficiently so I can end up with a list of desired values each company, filing, and date?
Is there generally a better way to do this? I would think there would be some sort of API or python package in order to query the data I'm interested in. I know there is some high level information out there for Form 10-Ks and Form 10-Qs however I am in Form Ds which is somewhat obscure. I just want to make sure I am spending my time effectively on the best possible solution.

Thank you for the help!

I'm a little confused: the CIK in your code is a different format from the metadata table and the first part of the question. Actual Edgar CIK are 10 digits, including leading zeros. Is that a typo? — Jack Fleeting, Apr 11 '19 at 11:01
It looks like both in the table up top and in my question #1 I am dropping the leading zeros. Searching Nicholas Financial Inc returns the CIK I grabbed without the zeros and Capstone Turbine Corp looks like it aligns with the first CIK in my Q1. Hope that helps. — bvd, Apr 11 '19 at 12:27
Yes I did. It was a bit easier (and there's a bit more documentation) on how to get high-level information what was filed. The underlying data itself within each of these filings was what was tricky. I think the data just dropped the leading zeroes bc of how I wrote it to the df though. — bvd, Apr 11 '19 at 15:47

score 0 · Answer 1 · answered Apr 11 '19 at 02:29

You need to define a function which can be essentially most of the code you have posted and that function should take 3 keyword arguments (your 3 values). Then rather than define the three in your code, you just pass in those values and return a result.

Then you take your list which you created and make a simple for loop around it to cal the function you defined with those three values and then do something with the result.

def get_data(value1, value2, value3):
    # your main code here but replace with your arguments above.
    return content

for company in companies:
    content = get_data(value1, value2, value3)
    # do something with content

score 0 · Answer 2 · answered Apr 11 '19 at 23:58

Assuming you have a dataframe sec with correctly named columns for your list of filings, above, you first need to extract from the dataframe the relevant information into three lists:

cik = list(sec['cik'].values)
dat = list(sec['date'].values)
typ = list(sec['type'].values)

Then you create your base_url, with the items inserted and get your data:

for c, t, d in zip(cik, typ, dat):
  base_url = f"https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany&CIK={c}&type={t}&dateb={d}"
  edgar_resp = requests.get(base_url)

And go from there.

How to Use Beautiful Soup to Scrape SEC's Edgar Database and Receive Desire Data

2 Answers2