0

I have the following Spider which basically requests the start_urls and for every URL in there it has to do many sub Requests.

def parse(self, response): 
    print(response.request.headers['User-Agent'])

    for info in response.css('div.infolist'):

        item = MasterdataScraperItem()
        
        info_url = BASE_URL + info.css('a::attr(href)').get() # URL to subpage
        print('Subpage: ' + info_url)
    
        item['name'] = info.css('img::attr(alt)').get()
        
        yield scrapy.Request(info_url, callback=self.parse_info, meta={'item': item})

The for loop in the code above runs around 200 times and after around 100 iterations I get the HTTP Code 429.

My idea was to set DOWNLOAD_DELAY to 3.0 but this somehow has not applied to the loop and scrapy.Request is just called directly a few hundred times.

Is there a way to wait n-seconds before the the next iteration of scrapy.Requests is called?

Xiddoc
  • 3,369
  • 3
  • 11
  • 37
csphmay
  • 89
  • 2
  • 11
  • 1
    Does this answer your question? [How to give delay between each requests in scrapy?](https://stackoverflow.com/questions/8768439/how-to-give-delay-between-each-requests-in-scrapy) – Kulasangar Jan 05 '23 at 11:42
  • @Kulasangar No, I have mentioned that I have tried it with DOWNLOAD_DELAY but it's not getting applied to scrapy.Request – csphmay Jan 05 '23 at 11:49
  • check out concurrent_requests and autothrottle settings – Alexander Jan 06 '23 at 02:52
  • @Alexander concurrent_requests is set to 1 and autothrottle is enabled – csphmay Jan 06 '23 at 09:45

1 Answers1

0

You can limit the number of requests handled by the downloader at the same time using CONCURRENT_REQUESTS

class MySpider(scrapy.Spider):
    custom_settings = {
        "CONCURRENT_REQUESTS": 1,
    }
    # Rest of code
zaki98
  • 1,048
  • 8
  • 13