Python BeautifulSoup picking webpages, same codes working on and off

Question

I am using the same code to pickup web texts but most of the time it shows “WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.”, and surprisingly sometime it works, for example I run the code 12 times, 1 time is successful.

Same code, same web address. Why is this happening?

from bs4 import BeautifulSoup
import re
import urllib2


url = "http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

web_p = soup.find_all('span',class_='url')

for web in web_p:
    print web

Trackback details like below:

Traceback (most recent call last):
  File "C:\Python27\lib\idlelib\run.py", line 112, in main
seq, request = rpc.request_queue.get(block=True, timeout=0.05)
  File "C:\Python27\lib\Queue.py", line 176, in get
    raise Empty
Empty

possible duplicate of [Beautiful Soup, gets warning and then error halfway through code](http://stackoverflow.com/questions/17688063/beautiful-soup-gets-warning-and-then-error-halfway-through-code) — isedev, Feb 25 '14 at 03:52

score 2 · Accepted Answer · edited May 23 '17 at 12:15

thanks isedev for the guidance, and the answers in Does python urllib2 automatically uncompress gzip data fetched from webpage?, the codes changed to below works:

from StringIO import StringIO
import gzip
from bs4 import BeautifulSoup
import re
import urllib2


request = urllib2.Request('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO( response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()

soup = BeautifulSoup(data)

web_p = soup.find_all('span',class_='url')

for web in web_p:
    print web

thanks to Blender's guidance, the code can be simplified:

from bs4 import BeautifulSoup
import requests

html = requests.get('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age').text
soup = BeautifulSoup(html)
web_p = soup.find_all('span',class_='url')
for web in web_p:
    print web

You could just use `requests` and let it decompress the response for you. — Blender, Feb 25 '14 at 04:15
There's nothing more to it. You just import it and use it: `html = requests.get(url).text` — Blender, Feb 25 '14 at 04:32

Python BeautifulSoup picking webpages, same codes working on and off

1 Answers1