Scrapy: Stop scraping after 10x 404

Question

I want to scrape an API. After there are 10x 404 pages in a row the Spider should stop, since it most likely reached the end of my list. At the same time, my Spider can handle 404 pages that are just in between due to deleted events/pks.

Currently, my counter always start at 0 for each parsed url. That's not what I wanted it to do.

class EventSpider(scrapy.Spider):
    handle_httpstatus_list = [404]  # TODO: Move to middleware?
    name = "eventpage"
    start_urls = ['https://www.eventwebsite.com/api-internal/v1/events/%s/?format=json' % page for page in range(1,12000)]

    def parse(self, response):
        # Accept X 404 error until stop processing
        count_404 = 0

        print("################", response.status, "################")
        if response.status == 404:
            count_404 += 1
            print("404 Counter: ", count_404)
        print("################################")
        if count_404 == 10:
            break  # Stop scraping

You are creating a *local* variable called `count_404` with the value 0 — sshashank124, Jan 02 '20 at 12:20
The closest way to replicate what you are trying to do is shown [here](https://stackoverflow.com/questions/279561/what-is-the-python-equivalent-of-static-variables-inside-a-function) although it is definitely not the most idiomatic/clean way to do it — sshashank124, Jan 02 '20 at 12:22
After I moved it out of the local function it works! Thank you. — Joey Coder, Jan 02 '20 at 13:04

Scrapy: Stop scraping after 10x 404

0 Answers0