Utilizing Python3, BeautifulSoup and very minimal regex, I'm trying to scrape the text off of this webpage:
http://www.presidency.ucsb.edu/ws/?pid=2921
I have already succesfully extracted its html into a file. In fact I've done this with almost all of the presidential speeches available on this website; I have 247 (out of 258 possible) speeches' html saved locally on my computer.
My code for extracting just the text off of each page looks like this:
import re
from bs4 import BeautifulSoup
with open('scan_here.txt') as reference: #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
for line in reference:
line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
doc = f.read()
soup = BeautifulSoup(doc, 'html.parser')
for speech in soup.select('span.display-text'):
final_speech = str(speech)
print(final_speech)
Utilizing this code, I get the following error message:
Traceback (most recent call last):
File "extract_individual_speeches.py", line 11, in <module>
doc = f.read()
File "/usr/lib/python3.4/codecs.py", line 319, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte
I understand this is a decode error and have tried to run this code on other html files, not just the first one which appears on the list of file names in 'scan_text.txt'. Same error, so I think it's an encoding issue local to the html files.
I think the problem might lie with this third line of the html, which has the same encoding for all my html files:
<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">
What is 'windows-1251?' I assume it's the problem here. I've looked it up and seen there are some windows-1251 to UTF-8 converters, but I didn't see one which works well with Python.
I found this SO thread which seems to deal with this issue of conversion, but I'm not sure how to integrate it with my existing code.
Any help on this issue is much appreciated, TIA.