0

I have written the following code to download the text of a financial report of Apple on the SEC:

headers = {'User-Agent' : 'email'}
response = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt', headers=headers)
content = response.content.decode()
try:
    soup = BeautifulSoup(content, "html.parser")
    if soup is None:
        raise Exception("Failed to parse with html.parser")
except Exception as e:
    soup = BeautifulSoup(content, "lxml")
text = soup.get_text()
print(text)

This returns the full decoded text file of the financial report I have downloaded. However, some of the output is not properly decoded. So for example, instead of Company's the output shows Company’s. I have tried encoding and decoding again, but that does not work, so now I am pretty much stuck. I hope someone knows how I should modify my code to get the desired output.

Barmar
  • 741,623
  • 53
  • 500
  • 612
larzz_010
  • 13
  • 3
  • show us an example that doesnt work as you expect... eg `soup=BeautifulSoup('
    company's
    ','lxml')` should work fine ...
    – Joran Beasley May 03 '23 at 23:27
  • 2
    `decode()` is for decoding binary character encoding schemes, it doesn't translate between different characters. It won't turn apostrophe to single quote. – Barmar May 03 '23 at 23:31
  • @JoranBeasley He did, although it wasn't easy to see in his question formatting. – Barmar May 03 '23 at 23:32
  • FYI, I found the linked question by googling "python convert apostrophe to ascii" – Barmar May 03 '23 at 23:34
  • @Barmar the linked question is literally not answered. The answer just removes the character completely. That is not what I want, but I am sorry you thought I did no research at all. And just a suggestion to you, you should try to use some kindness. Also, the output does not show Company’s but just an error. – larzz_010 May 03 '23 at 23:57
  • @larzz_010 The answer using the Unidecode library seems to do exactly what you want. `Gavin O’Connor` => `Gavin O'Connor` – Barmar May 03 '23 at 23:58

0 Answers0