0

I'm trying to build my own web scraper using Python. One of the steps involves parsing an HTML page, for which I am using BeautifulSoup, which is the parser recommended in most tutorials. Here is my code which should extract the page and print it:

import urllib
from bs4 import BeautifulSoup

urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk =  handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
soup = soup.prettify()
print (soup)

However, there seems to be an error when I do soup.prettify() and then print it. The error is:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa9' in position 16052: ordinal not in range(128)

To resolve this, I googled further and came across this answer of SO which resolved it. I basically had to set the encoding to 'utf=8' which I did. So here is the modded code (last 2 lines only):

soup = soup.prettify().encode('utf-8')
print (soup)

This works just fine. The problem arises when I try to use the soup.get_text() method as mentioned on a tutorial here. Whenever I do soup.get_text(), I get an error:

AttributeError: 'str' object has no attribute 'get_text'

I think this is expected since I'm encoding the soup to 'utf-8' and it's changing it to a str. I tried printing type(soup) before and after utf-8 conversion and as expected, before conversion it was an Object of the bs4.BeautifulSoup class and after, it was str.

How do I work around this? I'm pretty sure I'm doing something wrong and there's a proper way around this. Unfortunately, I'm not too familiar with Python, so please bear with me

Community
  • 1
  • 1
gabbar0x
  • 4,046
  • 5
  • 31
  • 51
  • If your only problem is non-working `print` you can do `encode` on the print lines. – wRAR Mar 27 '16 at 07:46
  • so basically `print(soup.get_text().encode('utf-8')`)? – gabbar0x Mar 27 '16 at 07:47
  • Yes, why not? You don't need to replace your `soup` object just to print it. – wRAR Mar 27 '16 at 07:49
  • I get the error `'unicode' object has no attribute 'get_text'` – gabbar0x Mar 27 '16 at 07:51
  • 1
    just print it or use some other variable to save it other than soup – Sayed Zainul Abideen Mar 27 '16 at 07:52
  • 1
    Yup, that's because `soup.prettify()` returns `unicode` and you've lost your original `bs4.BeautifulSoup` object. This error was already in your code. – wRAR Mar 27 '16 at 07:54
  • Yes you are right. It works now. Can you tell me *When* (in what scenario) we should use `soup.prettify()`? It seems that when you want to extract text, we should not be using it. It seems to me that `soup.prettify()` only gives us a navigable string, is that correct? – gabbar0x Mar 27 '16 at 07:55
  • "Navigable string"? It just prints the parsed tree as one document, it is not useful for anything else. – wRAR Mar 27 '16 at 08:02
  • There are hundreds of similar reports, doesn't one of them address your issues? Maybe it would help you if you extracted a minimal example first, as the guidelines require? – Ulrich Eckhardt Mar 27 '16 at 08:19

2 Answers2

1

You should not discard your original soup object. You can call soup.prettify().encode('utf-8') when you need to print it (or save it into a different variable).

wRAR
  • 25,009
  • 4
  • 84
  • 97
1
import urllib
from bs4 import BeautifulSoup

urlToRead = "http://www.randomjoke.com/topic/haha.php"
handle = urllib.urlopen(urlToRead)
htmlGunk =  handle.read()
soup = BeautifulSoup(htmlGunk, "html.parser")
html_code = soup.prettify().encode('utf-8')
text = soup.get_text().encode('utf-8')

print html_code
print "#################"
print text



# a = soup.find()
# l = []
# for i in a.next_elements:
#     l.append(i)