5

I am quite new to scrapy (and my background is not informatics). I have a website that I cant visit with my local ip, since I am banned, I can visit it using a VPN service on browser. To my spider be able to crawl it I set up a pool of proxies that I have found here http://proxylist.hidemyass.com/ . And with that my spider is able to crawl and scrape items but my doubt is if I have to change the proxy pool list everyday?? Sorry if my question is a dumb one...

here my settings.py:

BOT_NAME = 'reviews'

SPIDER_MODULES = ['reviews.spiders']
NEWSPIDER_MODULE = 'reviews.spiders'
DOWNLOAD_DELAY = 1
RANDOMIZE_DOWNLOAD_DELAY = True

DOWNLOADER_MIDDLEWARES = {
        'scrapy.contrib.downloadermiddleware.httpcompression.HttpCompressionMiddleware':None, # to avoid the raise IOError, 'Not a gzipped file' exceptions.IOError: Not a gzipped file
        'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None,
        'reviews.rotate_useragent.RotateUserAgentMiddleware' :400,
        'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110, 
        'reviews.middlewares.ProxyMiddleware': 100,
    }

PROXIES = [{'ip_port': '168.63.249.35:80', 'user_pass': ''},
           {'ip_port': '162.17.98.242:8888', 'user_pass': ''},
           {'ip_port': '70.168.108.216:80', 'user_pass': ''},
           {'ip_port': '45.64.136.154:8080', 'user_pass': ''},
           {'ip_port': '149.5.36.153:8080', 'user_pass': ''},
           {'ip_port': '185.12.7.74:8080', 'user_pass': ''},
           {'ip_port': '150.129.130.180:8080', 'user_pass': ''},
           {'ip_port': '185.22.9.145:8080', 'user_pass': ''},
           {'ip_port': '200.20.168.135:80', 'user_pass': ''},
           {'ip_port': '177.55.64.38:8080', 'user_pass': ''},]

# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'reviews (+http://www.yourdomain.com)'

here my middlewares.py:

import base64
import random
from settings import PROXIES

class ProxyMiddleware(object):
    def process_request(self, request, spider):
        proxy = random.choice(PROXIES)
        if proxy['user_pass'] is not None:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']
            encoded_user_pass = base64.encodestring(proxy['user_pass'])
            request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass            
        else:
            request.meta['proxy'] = "http://%s" % proxy['ip_port']

Another question: if I have a website that is https should I have a proxy pool list for https only? and then another function class HTTPSProxyMiddleware(object) that recives a list HTTPS_PROXIES ?

my rotate_useragent.py:

import random
from scrapy.contrib.downloadermiddleware.useragent import UserAgentMiddleware

class RotateUserAgentMiddleware(UserAgentMiddleware):
    def __init__(self, user_agent=''):
        self.user_agent = user_agent

    def process_request(self, request, spider):
        ua = random.choice(self.user_agent_list)
        if ua:
            request.headers.setdefault('User-Agent', ua)

    #the default user_agent_list composes chrome,I E,firefox,Mozilla,opera,netscape
    #for more user agent strings,you can find it in http://www.useragentstring.com/pages/useragentstring.php
    user_agent_list = [\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"\
        "Mozilla/5.0 (X11; CrOS i686 2268.111.0) AppleWebKit/536.11 (KHTML, like Gecko) Chrome/20.0.1132.57 Safari/536.11",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1092.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.6 (KHTML, like Gecko) Chrome/20.0.1090.0 Safari/536.6",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/19.77.34.5 Safari/537.1",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.9 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.0) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.36 Safari/536.5",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 5.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_0) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1063.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1062.0 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.1 Safari/536.3",\
        "Mozilla/5.0 (Windows NT 6.2) AppleWebKit/536.3 (KHTML, like Gecko) Chrome/19.0.1061.0 Safari/536.3",\
        "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24",\
        "Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/535.24 (KHTML, like Gecko) Chrome/19.0.1055.1 Safari/535.24"
       ]

Another question and last(sorry if is again a stupid one) in settings.py there is a commented default part # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'reviews (+http://www.yourdomain.com)' should I uncomment it and put my personal informations? or just leave it like that? I wanna crawl effeciently but regarding the good policies and good habits to avoid possible ban issues...

I am asking this all because with this things my spiders started to throw errors like

twisted.internet.error.TimeoutError: User timeout caused connection failure: Getting http://www.example.com/browse/?start=884 took longer than 180.0 seconds.

and

Error downloading <GET http://www.example.com/article/2883892/x-review.html>: [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>]

and

Error downloading <GET http://www.example.com/browse/?start=6747>: TCP connection timed out: 110: Connection timed out.

Thanks so much for your help and time.

Inês Martins
  • 530
  • 2
  • 10
  • 23

2 Answers2

5

There is already a library to do this. https://github.com/aivarsk/scrapy-proxies

Please download it from there. It has not been in pypi.org yet, so you can't install it easily using pip or easy_install.

Aminah Nuraini
  • 18,120
  • 8
  • 90
  • 108
1
  1. There's not a correct answer for this. Some proxies are not always available so you have to check them now and then. Also, if you use the same proxy every time the server you are scraping may block its IP as well, but that depends on the security mechanisms this server has.
  2. Yes, because you don't know if all the proxies you have in your pool support HTTPS. Or you could have just one pool and add a field to each proxy that indicates its HTTPS support.
  3. In your settings your are disabling the user agent middleware: 'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware' : None. The USER_AGENT setting won't have any effect.
lufte
  • 1,283
  • 1
  • 11
  • 21