Currently, I'm trying to scrape 10-K submission text files on sec.gov.
Here's an example text file:
https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt
The text document contains things like HTML tags, CSS styles, and JavaScript. Ideally, I'd like to scrape only the content after removing all the tags and styling.
First, I tried the obvious get_text()
method from BeautifulSoup. That didn't work out.
Then I tried using regex to remove everything between < and >. Unfortunately, also this didn't work out entirely. It keeps some tags, styles, and scripts.
Does anyone have a clean solution for me to accomplish my goal?
Here is my code so far:
import requests
import re
url = 'https://www.sec.gov/Archives/edgar/data/320193/000119312515356351/0001193125-15-356351.txt'
response = requests.get(url)
text = re.sub('<.*?>', '', response.text)
print(text)