2

In scrapy projects, we can get persistence support by defining a job directory through the JOBDIR setting for eg.

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

But how to do the same when running spiders using scrapy.crawler.CrawlerProcess from a python script as answered in How to run Scrapy from within a Python script?

Amit Basuri
  • 563
  • 5
  • 12

1 Answers1

3

As your reference question points out you can pass settings to CrawlerProcess instance.

So all you need to do is pass JOBDIR setting:

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'JOBDIR': 'crawls/somespider-1'  # <----- Here
})

process.crawl(MySpider)
process.start() 
Granitosaurus
  • 20,530
  • 5
  • 57
  • 82
  • This isn't working for me. I am trying to do a sitemap spider, and it keeps saying "Filtered duplicate request: " However, there are GBs of requests in the requests.queue file. Why is Scrapy not using those, but instead starting with sitemap? If it filters out the only starting url I give it, it'll never find anything! – superdee Mar 14 '19 at 15:14