How to run and save scrapy state from a python script

Question

In scrapy projects, we can get persistence support by defining a job directory through the JOBDIR setting for eg.

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

But how to do the same when running spiders using scrapy.crawler.CrawlerProcess from a python script as answered in How to run Scrapy from within a Python script?

score 3 · Accepted Answer · answered May 03 '18 at 07:03

3

As your reference question points out you can pass settings to CrawlerProcess instance.

So all you need to do is pass JOBDIR setting:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'JOBDIR': 'crawls/somespider-1'  # <----- Here
})

process.crawl(MySpider)
process.start()

answered May 03 '18 at 07:03

Granitosaurus

20,530
5
57
82

This isn't working for me. I am trying to do a sitemap spider, and it keeps saying "Filtered duplicate request: " However, there are GBs of requests in the requests.queue file. Why is Scrapy not using those, but instead starting with sitemap? If it filters out the only starting url I give it, it'll never find anything! – superdee Mar 14 '19 at 15:14

How to run and save scrapy state from a python script

1 Answers1