1

Utilizing Python3, BeautifulSoup and very minimal regex, I'm trying to scrape the text off of this webpage:

http://www.presidency.ucsb.edu/ws/?pid=2921

I have already succesfully extracted its html into a file. In fact I've done this with almost all of the presidential speeches available on this website; I have 247 (out of 258 possible) speeches' html saved locally on my computer.

My code for extracting just the text off of each page looks like this:

import re
from bs4 import BeautifulSoup

with open('scan_here.txt') as reference:       #'scan_here.txt' is a file containing all the pages whose html I have downloaded successfully
    for line in reference:
        line_unclean = reference.readline() #each file's name is just a random string of 5-6 integers
        line = str(re.sub(r'\n', '', line_unclean)) #for removing '\n' from each file name
        f = open(('local_path_to_folder_containing_all_the_html_files\\') + line)
        doc = f.read()
        soup = BeautifulSoup(doc, 'html.parser')
        for speech in soup.select('span.display-text'):
            final_speech = str(speech)
            print(final_speech)

Utilizing this code, I get the following error message:

Traceback (most recent call last):
  File "extract_individual_speeches.py", line 11, in <module>
    doc = f.read()
  File "/usr/lib/python3.4/codecs.py", line 319, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x97 in position 56443: invalid start byte

I understand this is a decode error and have tried to run this code on other html files, not just the first one which appears on the list of file names in 'scan_text.txt'. Same error, so I think it's an encoding issue local to the html files.

I think the problem might lie with this third line of the html, which has the same encoding for all my html files:

<html>
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=windows-1251">

What is 'windows-1251?' I assume it's the problem here. I've looked it up and seen there are some windows-1251 to UTF-8 converters, but I didn't see one which works well with Python.

I found this SO thread which seems to deal with this issue of conversion, but I'm not sure how to integrate it with my existing code.

Any help on this issue is much appreciated, TIA.

Community
  • 1
  • 1
dataelephant
  • 563
  • 2
  • 7
  • 21

1 Answers1

2

'windows-1251' is a standard Windows encoding. What you need is UTF-8. You can define an encoding when you open a file.

Try something like this:

with open(file,'r',encoding='windows-1251') as f:
  text = f.read()

or:

text = text.decode('windows-1251')

You can also use codecs:

import codecs
f = codecs.open(file,'r','windows-1251').read()
codecs.open(file,'w','UTF-8').write(f)
Peter S.
  • 51
  • 3
  • Hi Peter S., thanks for the answers. I don't even know if it's an issue with the html file's encoding, though, because when I manually save it as UTF-8 I get the same error message. – dataelephant Mar 23 '16 at 06:50
  • Remove BOM from the file and try text = text.encode(encoding='UTF-8',errors='replace') – Peter S. Mar 23 '16 at 13:07
  • Oh I see, it's a different format which details the byte sequence. I didn't save it with UTF-8 BOM, just regular UTF-8. – dataelephant Mar 23 '16 at 13:49