Why is some of the text I extracted not properly decoded in Python?

Question

I have written the following code to download the text of a financial report of Apple on the SEC:

headers = {'User-Agent' : 'email'}
response = requests.get('https://www.sec.gov/Archives/edgar/data/320193/000119312514383437/0001193125-14-383437.txt', headers=headers)
content = response.content.decode()
try:
    soup = BeautifulSoup(content, "html.parser")
    if soup is None:
        raise Exception("Failed to parse with html.parser")
except Exception as e:
    soup = BeautifulSoup(content, "lxml")
text = soup.get_text()
print(text)

This returns the full decoded text file of the financial report I have downloaded. However, some of the output is not properly decoded. So for example, instead of Company's the output shows Company’s. I have tried encoding and decoding again, but that does not work, so now I am pretty much stuck. I hope someone knows how I should modify my code to get the desired output.

show us an example that doesnt work as you expect... eg `soup=BeautifulSoup('
company's
','lxml')` should work fine ... — Joran Beasley, May 03 '23 at 23:27
`decode()` is for decoding binary character encoding schemes, it doesn't translate between different characters. It won't turn apostrophe to single quote. — Barmar, May 03 '23 at 23:31
@JoranBeasley He did, although it wasn't easy to see in his question formatting. — Barmar, May 03 '23 at 23:32
FYI, I found the linked question by googling "python convert apostrophe to ascii" — Barmar, May 03 '23 at 23:34
@Barmar the linked question is literally not answered. The answer just removes the character completely. That is not what I want, but I am sorry you thought I did no research at all. And just a suggestion to you, you should try to use some kindness. Also, the output does not show Company’s but just an error. — larzz_010, May 03 '23 at 23:57
@larzz_010 The answer using the Unidecode library seems to do exactly what you want. `Gavin O’Connor` => `Gavin O'Connor` — Barmar, May 03 '23 at 23:58

Why is some of the text I extracted not properly decoded in Python?

0 Answers0