2

I am using the same code to pickup web texts but most of the time it shows “WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.”, and surprisingly sometime it works, for example I run the code 12 times, 1 time is successful.

Same code, same web address. Why is this happening?

from bs4 import BeautifulSoup
import re
import urllib2


url = "http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age"
page = urllib2.urlopen(url)
soup = BeautifulSoup(page.read())

web_p = soup.find_all('span',class_='url')

for web in web_p:
    print web 

Trackback details like below:

Traceback (most recent call last):
  File "C:\Python27\lib\idlelib\run.py", line 112, in main
seq, request = rpc.request_queue.get(block=True, timeout=0.05)
  File "C:\Python27\lib\Queue.py", line 176, in get
    raise Empty
Empty
halfer
  • 19,824
  • 17
  • 99
  • 186
Mark K
  • 8,767
  • 14
  • 58
  • 118
  • Post the traceback that appears when the error is raised. – tsroten Feb 25 '14 at 03:49
  • 1
    possible duplicate of [Beautiful Soup, gets warning and then error halfway through code](http://stackoverflow.com/questions/17688063/beautiful-soup-gets-warning-and-then-error-halfway-through-code) – isedev Feb 25 '14 at 03:52

1 Answers1

2

thanks isedev for the guidance, and the answers in Does python urllib2 automatically uncompress gzip data fetched from webpage?, the codes changed to below works:

from StringIO import StringIO
import gzip
from bs4 import BeautifulSoup
import re
import urllib2


request = urllib2.Request('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age')
request.add_header('Accept-encoding', 'gzip')
response = urllib2.urlopen(request)
if response.info().get('Content-Encoding') == 'gzip':
    buf = StringIO( response.read())
    f = gzip.GzipFile(fileobj=buf)
    data = f.read()

soup = BeautifulSoup(data)

web_p = soup.find_all('span',class_='url')

for web in web_p:
    print web


thanks to Blender's guidance, the code can be simplified:

from bs4 import BeautifulSoup
import requests

html = requests.get('http://nz.sports.search.yahoo.com/search?p=basketball&fr=sports-nz-ss&age=1w&focuslim=age').text
soup = BeautifulSoup(html)
web_p = soup.find_all('span',class_='url')
for web in web_p:
    print web
Community
  • 1
  • 1
Mark K
  • 8,767
  • 14
  • 58
  • 118