I want to scrape an API. After there are 10x 404 pages in a row the Spider should stop, since it most likely reached the end of my list. At the same time, my Spider can handle 404 pages that are just in between due to deleted events/pks.
Currently, my counter always start at 0 for each parsed url. That's not what I wanted it to do.
class EventSpider(scrapy.Spider):
handle_httpstatus_list = [404] # TODO: Move to middleware?
name = "eventpage"
start_urls = ['https://www.eventwebsite.com/api-internal/v1/events/%s/?format=json' % page for page in range(1,12000)]
def parse(self, response):
# Accept X 404 error until stop processing
count_404 = 0
print("################", response.status, "################")
if response.status == 404:
count_404 += 1
print("404 Counter: ", count_404)
print("################################")
if count_404 == 10:
break # Stop scraping