Extraction of text using Beautiful Soup and regular expressions in 10-K Edgar fillings

Question

I want to automatically extract section "1A. Risk Factors" from around 10000 files and write it into txt files. A sample URL with a file can be found here

The desired section is between "Item 1a Risk Factors" and "Item 1b". The thing is that the 'item', '1a' and '1b' might look different in all these files and may be present in multiple places - not only the longest, proper one that interest me. Thus, there should be some regular expressions used, so that:

The longest part between "1a" and "1b" is extracted (otherwise the table of contents will appear and other useless elements)
Different variants of the expressions are taken into consideration

I tried to implement these two goals in the script, but as it's my first project in Python, I just randomly sorted expressions that I think might work and apparently they are in a wrong order (I'm sure I should iterate on the "< a >"elements, add each extracted "section" to a list, then choose the longest one and write it to a file, though I don't know how to implement this idea). EDIT: Currently my method returns very little data between 1a and 1b (i think it's a page number) from the table of contents and then it stops...(?)

My code:

import requests
import re
import csv

from bs4 import BeautifulSoup as bs
with open('indexes.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        fn1 = line[0]
        fn2 = re.sub(r'[/\\]', '', line[1])
        fn3 = re.sub(r'[/\\]', '', line[2])
        fn4 = line[3]
        saveas = '-'.join([fn1, fn2, fn3, fn4])
        f = open(saveas + ".txt", "w+",encoding="utf-8")
        url = 'https://www.sec.gov/Archives/' + line[4].strip()
        print(url)
        response = requests.get(url)
        soup = bs(response.content, 'html.parser')
        risks = soup.find_all('a')
        regexTxt = 'item[^a-zA-Z\n]*1a.*item[^a-zA-Z\n]*1b'
        for risk in risks:
            for i in risk.findAllNext():
                i.get_text()
                sections = re.findall(regexTxt, str(i), re.IGNORECASE | re.DOTALL)
                for section in sections:
                    clean = re.compile('<.*?>')
                    # section = re.sub(r'table of contents', '', section, flags=re.IGNORECASE)
                    # section = section.strip()
                    # section = re.sub('\s+', '', section).strip()
                    print(re.sub(clean, '', section))

The goal is to find the longest part between "1a" and "1b" (regardless of how they exactly look) in the current URL and write it to a file.

Hello, again! Two preliminary things: first, Risk Factors is not always between Items 1a and 1b; in many filings there is no Item 1b (Unresolved Staff Comments) and the counting goes straight to Item 2. Second, parsing html with regex is considered a bad idea; see (for one of many examples) https://stackoverflow.com/a/1732454/9448090. — Jack Fleeting, Aug 01 '19 at 15:19
Hi! I really enjoyed your comment about html with regex and you are right about the lack of 1b in some of the files. I would use your script from my [previous](https://stackoverflow.com/questions/57286580/unknown-encoding-of-files-in-a-resulting-beautiful-soup-txt-file) question, but for some reason it doesn't work for 70% of the URLs (f.ex. [this one](https://www.sec.gov/Archives/edgar/data/1000623/0001000623-18-000044.txt) ). I even don't see any difference in the form of "item 1a"/"item" with the properly processed files. Do you have any idea why it doesn't work? — Karolina Andruszkiewicz, Aug 01 '19 at 17:11
Of course the script would fail in most cases; there is no rime or reason in the way EDGAR docs are formatted. For example, the page you linked to in your comment above doesn't even render in a browser! No idea where you got it from, but you should use this link (https://www.sec.gov/Archives/edgar/data/1000623/000100062318000044/swmform10-k12312017.htm) instead. But more generally, parsing 10,000 filings is a massive undertaking with significant cleanup work. I don't think there's a way around it. — Jack Fleeting, Aug 01 '19 at 17:28

score 0 · Answer 1 · answered Aug 01 '19 at 22:54

In the end I used a CSV file, that contains a column HTMURL, which is the link to htm-format 10-K. I got it from Kai Chen that created this website. I wrote a simple script that writes pure txt into files. Processing it will be a simple task now.

import requests
import csv
from pathlib import Path

from bs4 import BeautifulSoup
with open('index.csv', newline='') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for line in reader:
        print(line[9])
        url = line[9]
        html_doc = requests.get(url).text
        soup = BeautifulSoup(html_doc, 'html.parser')
        print(soup.get_text())
        name = line[1]
        name = name.replace('/', '')
        name = name.replace("/PA/", "")
        name = name.replace("/DE/", "")
        dir = Path(name + line[4] + ".txt")
        f = open(dir, "w+", encoding="utf-8")
        if dir.is_dir():
            break
        else: f.write(soup.get_text())

Extraction of text using Beautiful Soup and regular expressions in 10-K Edgar fillings

1 Answers1

Linked