22

Are there anyone experienced with scraping SEC 10-K and 10-Q filings? I got stuck while trying to scrape monthly realised share repurchases from these filings. In specific, I would like to get the following information: 1. Period; 2. Total Number of Shares Purchased; 3. Average Price Paid per Share; 4. Total Number of Shares Purchased as Part of Publicly Announced Plans or Programs; 5. Maximum Number (or Approximate Dollar Value) of Shares that May Yet Be Purchased Under the Plans or Programs for each month from 2004 to 2014. I have in total 90,000+ forms to parse, so it won't be feasible to do it manually.

This information is usually reported under "Part 2 Item 5 Market for Registrant's Common Equity, Related Stockholder Matters and Issuer Purchases of Equity Securities" in 10-Ks and "Part 2 Item 2 Unregistered Sales of Equity Securities and Use of Proceeds".

Here is one example of the 10-Q filings that I need to parse: https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm

If a firm have no share repurchase, this table can be missing from the quarterly report.

I have tried to parse the html files with Python BeautifulSoup, but the results are not satisfactory, mainly because these files are not written in a consistent format.

For example, the only way I can think of to parse these forms is

from bs4 import BeautifulSoup
import requests
import unicodedata
import re

url='https://www.sec.gov/Archives/edgar/data/12978/000104746909007169/a2193892z10-q.htm'

def parse_html(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, 'html5lib')
    tables = soup.find_all('table') 

    identifier = re.compile(r'Total.*Number.*of.*Shares.*\w*Purchased.*', re.UNICODE|re.IGNORECASE|re.DOTALL)

    n = len(tables) -1
    rep_tables = []

    while n >= 0:
        table = tables[n]
        remove_invalid_tags(table)
        table_text = unicodedata.normalize('NFKD', table.text).encode('ascii','ignore')
        if re.search(identifier, table_text):
            rep_tables += [table]
            n -= 1
        else:
            n -= 1

    return rep_tables

def remove_invalid_tags(soup, invalid_tags=['sup', 'br']):
    for tag in invalid_tags:
        tags = soup.find_all(tag)
        if tags:
            [x.replaceWith(' ') for x in tags]

The above code only returns the messy that may contain the repurchase information. However, 1) it is not reliable; 2) it is very slow; 3) the following steps to scrape date/month, share price, and number of shares etc. are much more painful to do. I am wondering if there are more feasible languages/approaches/applications/databases to get such information? Thanks a million!

Jiayuan Chen
  • 221
  • 1
  • 2
  • 4
  • Full lists of the websites I need to parse is attached. I'll very appreciate if you could give me some hint! Thanks! https://www.dropbox.com/s/369aviq5vkno9o3/ListURL.xlsx?dl=0 – Jiayuan Chen Jul 22 '15 at 14:24
  • Hey have you had any luck? Just trying to do this with Teslas data? – user2946746 Jan 12 '19 at 21:24

1 Answers1

9

I'm not sure about python, but in R there is an beautiful solution using 'finstr' package (https://github.com/bergant/finstr). 'finstr' automatically extracts the financial statements (income statement, balance sheet, cash flow and etc.) from EDGAR using XBRL format.

Lamothy
  • 337
  • 4
  • 17
  • Any luck with the finstr package? I've been having issues with it. When I try to get the last two Q's for Tesla. – user2946746 Jan 13 '19 at 20:09
  • 2
    I actually find a better solution without this package. The link below has all historical financial statement data going back to 2009. It has very comprehensive financial data. All you need is to download them using any of your favorite data analysis tool. https://www.sec.gov/files/dera/data/financial-statement-data-sets/ – Lamothy Jan 15 '19 at 01:36
  • Thanks looks useful for some analysis. Would like to get more up to data analysis. I'm going to play around to see if there are some other solutions. – user2946746 Jan 15 '19 at 16:38
  • 4
    In case you think the link above is broken, try this one: https://www.sec.gov/dera/data/financial-statement-data-sets.html – Brad Ahrens Apr 08 '20 at 22:47
  • 1
    The trouble with the datasets is they are the raw filings and not the finalised numbers. There can be differences so it depends on what you are using the data for. – Foothill_trudger Sep 13 '20 at 05:18