Unable to extract data from BeautifulSoup object after utf-8 conversion due to 'str' typecasting

Question

I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:

import urllib
from bs4 import BeautifulSoup

urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk =  handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)

However, there seems to be an error when I do soup.prettify() and then print it. The error is:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 16052: ordinal not in range(128)

To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8' which I did. So here is the modded code (last 2 lines only):

soup = soup.prettify().encode('utf-8')
print (soup)

This works just fine. The problem arises when I try to use the soup.get_text() method as mentioned on a tutorial here. Whenever I do soup.get_text(), I get an error:

AttributeError: 'str' object has no attribute 'get_text'

I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str. I tried printing type(soup) before and after utf-8 conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup class and after, it was str.

How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me

If your only problem is non-working `print` you can do `encode` on the print lines. — wRAR, Mar 27 '16 at 07:46
Yes, why not? You don't need to replace your `soup` object just to print it. — wRAR, Mar 27 '16 at 07:49
I get the error `'unicode' object has no attribute 'get_text'` — gabbar0x, Mar 27 '16 at 07:51
just print it or use some other variable to save it other than soup — Sayed Zainul Abideen, Mar 27 '16 at 07:52
Yup, that's because `soup.prettify()` returns `unicode` and you've lost your original `bs4.BeautifulSoup` object. This error was already in your code. — wRAR, Mar 27 '16 at 07:54
Yes you are right. It works now. Can you tell me *When* (in what scenario) we should use `soup.prettify()`? It seems that when you want to extract text, we should not be using it. It seems to me that `soup.prettify()` only gives us a navigable string, is that correct? — gabbar0x, Mar 27 '16 at 07:55
"Navigable string"? It just prints the parsed tree as one document, it is not useful for anything else. — wRAR, Mar 27 '16 at 08:02
There are hundreds of similar reports, doesn't one of them address your issues? Maybe it would help you if you extracted a minimal example first, as the guidelines require? — Ulrich Eckhardt, Mar 27 '16 at 08:19

score 1 · Accepted Answer · answered Mar 27 '16 at 08:01

1

You should not discard your original soup object. You can call soup.prettify().encode('utf-8') when you need to print it (or save it into a different variable).

answered Mar 27 '16 at 08:01

wRAR

25,009
4
84
97

score 1 · Answer 2 · answered Mar 27 '16 at 08:38

import urllib
from bs4 import BeautifulSoup

urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk =  handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
html_code = soup.prettify().encode('utf-8')
text = soup.get_text().encode('utf-8')

print html_code
print "#################"
print text



# a = soup.find()
# l = []
# for i in a.next_elements:
#     l.append(i)

Unable to extract data from BeautifulSoup object after utf-8 conversion due to 'str' typecasting

2 Answers2