3

I am a newbie trying to write a script of a web spider. I want to go to a page, enter a data in a textbox, go to the next page by clicking the submit button and retrieve all data on the new page, iteratively.

The following is the code I am trying:

import urllib
import urllib2
import string
import sys
from BeautifulSoup import BeautifulSoup

hdr = {'User-Agent': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11','Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8','Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3','Accept-Encoding': 'none','Accept-Language': 'en-US,en;q=0.8','Connection': 'keep-alive'}
values = {'query' : '5ed10c844ed4266a18d34e2ba06b381a' }
data = urllib.urlencode(values)
request = urllib2.Request("https://www.virustotal.com/#search", data, headers=hdr)
response = urllib2.urlopen(request)
the_page = response.read()
pool = BeautifulSoup(the_page)

print pool

The following is the error:

Traceback (most recent call last):
File "C:\Users\Dipanshu\Desktop\webscraping_demo.py", line 19, in <module>
response = urllib2.urlopen(request)
File "C:\Python27\lib\urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "C:\Python27\lib\urllib2.py", line 406, in open
response = meth(req, response)
File "C:\Python27\lib\urllib2.py", line 519, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Python27\lib\urllib2.py", line 444, in error
return self._call_chain(*args)
File "C:\Python27\lib\urllib2.py", line 378, in _call_chain
result = func(*args)
File "C:\Python27\lib\urllib2.py", line 527, in http_error_default
raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
HTTPError: HTTP Error 403: Forbidden

How can I solve this?

Dipanshu
  • 31
  • 1
  • 2

2 Answers2

1
from bs4 import BeautifulSoup
import urllib.request

user_agent = 'Mozilla/5.0'
headers = {'User-Agent': user_agent }
target_url = 'https://www.google.co.kr/search?q=cat&source=lnms&tbm=isch&sa=X&ved=0ahUKEwjtrZCg7uXbAhVaUd4KHc2HDgIQ_AUICygC&biw=1375&bih=842'

request = urllib.request.Request( url=target_url, headers=headers )
req = urllib.request.urlopen(request)
soup = BeautifulSoup(req.read(), 'html.parser')

target_url : google search webpage for "cat"

"headers" will get you through the Forbidden error. This code

Dane Lee
  • 1,984
  • 11
  • 14
0

From what I understand, your request parameters are not set up properly, and (maybe) drive your spider to a page you shouldn't view.

This user had a similar problem, but fixed it by modifying the headers.

Community
  • 1
  • 1
NlightNFotis
  • 9,559
  • 5
  • 43
  • 66
  • I added all the headers specified in that post already and it still didn't work! – Dipanshu Dec 21 '12 at 09:53
  • @Dipanshu I don't think you have to add the headers specified in that post, as he tries to open a different site. You have to customise your existing `request` and its parameters. – NlightNFotis Dec 21 '12 at 09:54