I have written the following code to download the text of a financial report of Apple on the SEC:
headers = {'User-Agent' : 'email'}
response = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt', headers=headers)
content = response.content.decode()
try:
soup = BeautifulSoup(content, "html.parser")
if soup is None:
raise Exception("Failed to parse with html.parser")
except Exception as e:
soup = BeautifulSoup(content, "lxml")
text = soup.get_text()
print(text)
This returns the full decoded text file of the financial report I have downloaded. However, some of the output is not properly decoded. So for example, instead of Company's
the output shows Company’s
. I have tried encoding and decoding again, but that does not work, so now I am pretty much stuck. I hope someone knows how I should modify my code to get the desired output.