I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:
import urllib
from bs4 import BeautifulSoup
urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk = handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)
However, there seems to be an error when I do soup.prettify()
and then print it. The error is:
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 16052: ordinal not in range(128)
To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8'
which I did. So here is the modded code (last 2 lines only):
soup = soup.prettify().encode('utf-8')
print (soup)
This works just fine. The problem arises when I try to use the soup.get_text()
method as mentioned on a tutorial here. Whenever I do soup.get_text()
, I get an error:
AttributeError: 'str' object has no attribute 'get_text'
I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str
. I tried printing type(soup)
before and after utf-8
conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup
class and after, it was str
.
How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me