0

In scrapy I'm trying to write a downloader middleware which filters the responses with 401, 403,410 and sends these URLs some new requests. The error says that response_request must return a Response or a Request. Because I yield 10 requests to make sure if the failed urls are tried enough times. How should I fix it? Thank you.

Here is my middleware code which I activated on settings.py

'''

class NegativeResponsesDownloaderMiddlerware(Spider):

def process_response(self, request, response, spider): ## encode each request with its http status
    # Called with the response returned from the downloader.

    print("---(NegativeResponsesDownloaderMiddlerware)")
    filtered_status_list = ['401', '403', '410']
    adaptoz = FailedRequestsItem()
    if response.status in filtered_status_list:
        adaptoz['error_code'][response.url] = response.status
        print("---(process_response) => Sending URL back do DOWNLOADER: URL =>",response.url)
        
        for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
            yield Request(response.url, self.check_retrial_result,headers = self.headers)
            
        raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)
        
        
    else:
        return response


    # Must either;
    # - return a Response object
    # - return a Request object
    # - or raise IgnoreRequest
def check_retrial_result(self, response):    
    if response.status == 200:
        x = XxxSpider()
        x.parse_event(response)
    else:

            return None   

'''

avakado0
  • 101
  • 1
  • 9

2 Answers2

0

Unfortunately scrapy doesn't know what to do with the middleware methods return value when you turn it into a generator, e.g. you cannot use yield in any of the interface methods for the middlewares.

Instead what you could do is generate the the sequence of requests and feed them back into the scrapy engine so that they can be parsed through your spider as if they were included in the start_urls or start_requests method.

You can do this by feeding each of the created requests to the spider.crawler.engine.crawl method if they pass your filter test and raising IgnoreRequest after completing the loop.

def process_response(self, request, response, spider):
    filtered_status_list = ['401', '403', '410']
    adaptoz = FailedRequestsItem()
    if response.status in filtered_status_list:
        adaptoz['error_code'][response.url] = response.status
        for i in range(self.settings.get('ERROR_HANDLING_ATTACK_RATE')):
            request = scrapy.Request(response.url, callback=callback_method, headers = self.headers)
            self.spider.crawler.engine.crawl(request, spider)
        raise IgnoreRequest(f"URL taken out from first flow. Error Code: ", adaptoz['error_code']," => URL = ", resp)
    return response 
Alexander
  • 16,091
  • 5
  • 13
  • 29
0

If I understand correctly, what you are trying to achieve can be done using settings alone:

RETRY_TIMES=10  # Default is 2 
RETRY_HTTP_CODES=[401, 403,410] # Default: [500, 502, 503, 504, 522, 524, 408, 429]

Docs are here.

Upendra
  • 716
  • 9
  • 17