6

Currently, I'm trying to scrape 10-K submission text files on sec.gov.

Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt

The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.

First, I tried the obvious get_text() method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.

Does anyone have a clean solution for me to accomplish my goal?

Here is my code so far:

import requests
import re

url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)
JonathanDavidArndt
  • 2,518
  • 13
  • 37
  • 49
teller.py3
  • 822
  • 8
  • 22

1 Answers1

5

Let's set a dummy string based on the example:

original_content = """
<script>console.log("test");</script>
<TD VALIGN="bottom" ALIGN="center"><FONT STYLE="font-family:Arial; ">(Address of principal executive offices)</FONT></TD>
"""

Now let's remove all the javascript.

from lxml.html.clean import Cleaner # remove javascript

# Delete javascript tags (some other options are left for the sake of example).

cleaner = Cleaner(
    comments = True, # True = remove comments
    meta=True, # True = remove meta tags
    scripts=True, # True = remove script tags
    embedded = True, # True = remove embeded tags
)
clean_dom = cleaner.clean_html(original_content)

(From https://stackoverflow.com/a/46371211/1204332)

And then we can either remove the HTML tags (extract the text) with the HTMLParser library:

from HTMLParser import HTMLParser

# Strip HTML.

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    s.feed(html)
    return s.get_data()

text_content = strip_tags(clean_dom)

print text_content

(From: https://stackoverflow.com/a/925630/1204332)

Or we could get the text with the lxml library:

from lxml.html import fromstring

print fromstring(original_content).text_content()
Ivan Chaer
  • 6,980
  • 1
  • 38
  • 48
  • 1
    The fact that we are using a class here is just an implementation detail for this library (HTMLParser). You can see the documentation here: https://docs.python.org/2/library/htmlparser.html . As you can see in their page, that's how they do it. Classes are handy, have a look when you have the time. :) Good coding, and welcome to Stack Overflow! – Ivan Chaer Sep 05 '18 at 17:29
  • I guess the difference lies in the parsers and methods used. While `lxml` is a binding for the C libraries `libxml2` and `libxslt`, the `HTMLParser` library is a Python based solution, much simpler. For the sake of completeness, I added the `lxml` option to the answer. If all you need is to clean the HTML tags, you could perhaps get away just with HTMLParser. In my experience, `lxml` was often the go-to tool. But I still use `HTMLParser` for removing HTML tags, as it gets the job done fine. – Ivan Chaer Sep 05 '18 at 21:21