3

I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6.

When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course).

Have I configured something wrong? With crawlera I should be getting at least 150 - 300 items per minute, no? Autothrottle is disabled.

Below you see my spider and part of my settings.py for the spider.

import scrapy
from ecom.items import EcomItem

class AmazonSpider(scrapy.Spider):
    name = "amazon_products"
    start_urls = ["https://www.amazon.fr/gp/browse.html?node=3055095031&rh=p_76:1&page=2"]    

    def parse(self, response):
        product_urls = response.xpath("//a[@class='a-link-normal s-access-detail-page s-color-twister-title-link a-text-normal']/@href").extract()

        for product_url in product_urls:
            yield response.follow(product_url, self.parse_product)


    def parse_product(self, response):
        item = EcomItem()
        item["url"] = response.url
        yield item

settings.py

CRAWWLERA_PRESERVE_DELAY = 0
CONCURRENT_REQUESTS = 80
CONCURRENT_REQUESTS_PER_DOMAIN = 80
DOWNLOAD_TIMEOUT = 20
LOG_LEVEL = 'ERROR'
RANDOMIZE_DOWNLOAD_DELAY = True
DOWNLOAD_DELAY = 0
AUTOTHROTTLE_DEBUG = False
AUTOTHROTTLE_MAX_DELAY = 4
AUTOTHROTTLE_START_DELAY = 0
AUTOTHROTTLE_ENABLED = False
COOKIES_ENABLED = False
dangee1705
  • 3,445
  • 1
  • 21
  • 40
Wramana
  • 183
  • 1
  • 4
  • 16

1 Answers1

3

To achieve higher crawl rates when using Crawlera with Scrapy, it’s recommended to disable the Auto Throttle add-on and increase the maximum number of concurrent requests (Depends on your plan). You may also want to increase the download timeout. Here is a list of settings that achieve that purpose:

CONCURRENT_REQUESTS = 100
CONCURRENT_REQUESTS_PER_DOMAIN = 100
AUTOTHROTTLE_ENABLED = False
DOWNLOAD_TIMEOUT = 30

You can find and use more Crawlera request headers here in order to optimize your use, i.e.:

  • X-Crawlera-Max-Retries(default 1) you can change to 0 but you might see more bans.
  • X-Crawlera-Timeout (default 30000) can be changed to smaller number if you expect from the website to respond faster.

changing those can give you more results per minute with a risk to get banned or request timeout.

Ami Hollander
  • 2,435
  • 3
  • 29
  • 47