8

Am working on a scrapy project to download images from a site which requires authentication.Everything works fine and I am able to download images. What I need is to pause and resume the spider to crawl images whenever needed. So I used whatever mentioned in scrapy manual to do so as follows. While running the spider used the below mentioned query

scrapy crawl somespider -s JOBDIR=crawls/somespider-1

To abort the engine pressed CTRL+C. To resume again used the same command.

But after resuming the spider is closed within few minutes,it doesn't resume from where it left off.

Updated:

class SampleSpider(Spider):
name = "sample project"
allowed_domains = ["xyz.com"]
start_urls = (
    'http://abcyz.com/',
    )

def parse(self, response):
    return FormRequest.from_response(response,
                                    formname='Loginform',
                                    formdata={'username': 'Name',
                                              'password': '****'},
                                    callback=self.after_login)

def after_login(self, response):
    # check login succeed before going on
    if "authentication error" in str(response.body).lower():
        print "I am error"
        return
    else:
        start_urls = ['..','..']
        for url in start_urls:
            yield Request(url=urls,callback=self.parse_phots,dont_filter=True)
def parse_photos(self,response):
     **downloading image here**

what am I doing wrong?

This is log I get when i run the spider after pause

2014-05-13 15:40:31+0530 [scrapy] INFO: Scrapy 0.22.0 started (bot: sampleproject)
2014-05-13 15:40:31+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-05-13 15:40:31+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sampleproject.spiders', 'SPIDER_MODULES': ['sampleproject.spiders'], 'BOT_NAME': 'sampleproject'}
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled downloader middlewares: RedirectMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-13 15:40:31+0530 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-05-13 15:40:31+0530 [sample] INFO: Spider opened
2014-05-13 15:40:31+0530 [sample] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-13 15:40:31+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-13 15:40:31+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080

......................

2014-05-13 15:42:06+0530 [sample] INFO: Dumping Scrapy stats:
    {'downloader/request_bytes': 141184,
     'downloader/request_count': 413,
     'downloader/request_method_count/GET': 412,
     'downloader/request_method_count/POST': 1,
     'downloader/response_bytes': 11213203,
     'downloader/response_count': 413,
     'downloader/response_status_count/200': 412,
     'downloader/response_status_count/404': 1,
     'file_count': 285,
     'file_status_count/downloaded': 285,
     'finish_reason': 'shutdown',
     'finish_time': datetime.datetime(2014, 5, 13, 10, 12, 6, 534088),
     'item_scraped_count': 125,
     'log_count/DEBUG': 826,
     'log_count/ERROR': 1,
     'log_count/INFO': 9,
     'log_count/WARNING': 219,
     'request_depth_max': 12,
     'response_received_count': 413,
     'scheduler/dequeued': 127,
     'scheduler/dequeued/disk': 127,
     'scheduler/enqueued': 403,
     'scheduler/enqueued/disk': 403,
     'start_time': datetime.datetime(2014, 5, 13, 10, 10, 31, 232618)}
2014-05-13 15:42:06+0530 [sample] INFO: Spider closed (shutdown)

After resume it stops and displays

INFO: Scrapy 0.22.0 started (bot: sampleproject)
2014-05-13 15:42:32+0530 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-05-13 15:42:32+0530 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'sampleproject.spiders', 'SPIDER_MODULES': ['sampleproject.spiders'], 'BOT_NAME': 'sampleproject'}
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled downloader middlewares: RedirectMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-05-13 15:42:32+0530 [scrapy] INFO: Enabled item pipelines: ImagesPipeline
2014-05-13 15:42:32+0530 [sample] INFO: Spider opened
*2014-05-13 15:42:32+0530 [sample] INFO: Resuming crawl (276 requests scheduled)*
2014-05-13 15:42:32+0530 [sample] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-05-13 15:42:32+0530 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-13 15:42:32+0530 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080


2014-05-13 15:43:19+0530 [sample] INFO: Closing spider (finished)
2014-05-13 15:43:19+0530 [sample] INFO: Dumping Scrapy stats:
    {'downloader/exception_count': 3,
     'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 3,
     'downloader/request_bytes': 132365,
     'downloader/request_count': 281,
     'downloader/request_method_count/GET': 281,
     'downloader/response_bytes': 567884,
     'downloader/response_count': 278,
     'downloader/response_status_count/200': 278,
     'file_count': 1,
     'file_status_count/downloaded': 1,
     'finish_reason': 'finished',
     'finish_time': datetime.datetime(2014, 5, 13, 10, 13, 19, 554981),
     'item_scraped_count': 276,
     'log_count/DEBUG': 561,
     'log_count/ERROR': 1,
     'log_count/INFO': 8,
     'log_count/WARNING': 1,
     'request_depth_max': 1,
     'response_received_count': 278,
     'scheduler/dequeued': 277,
     'scheduler/dequeued/disk': 277,
     'scheduler/enqueued': 1,
     'scheduler/enqueued/disk': 1,
     'start_time': datetime.datetime(2014, 5, 13, 10, 12, 32, 659276)}
2014-05-13 15:43:19+0530 [sample] INFO: Spider closed (finished)
user
  • 141
  • 1
  • 10

2 Answers2

2

instead of that command that you wrote, you can run this one:

scrapy crawl somespider --set JOBDIR=crawl1

And for stopping it you have to run control-C once! and wait for scrapy to stop. if you run control-C twice, it wont work properly!

Then for resuming the search run this command again:

scrapy crawl somespider --set JOBDIR=crawl1
Maryam Homayouni
  • 905
  • 9
  • 16
  • 1
    How do you cleanup a JOBDIR after crawl is finished? – Nirbhay Kundan Jul 12 '18 at 05:02
  • 1
    "it wont work properly" sounds like a bug. scrapy should use a journaled database, and commit regularly. as workaround, create backup copies of the jobdir, and kill + restart scrapy regularly – milahu Dec 14 '21 at 19:03
0

Since you have to authenticate, I'm assuming the cookies expired when you resumed the job. Refer:Scrapy Persistence Gotchas

Figure out the http status code when cookies are expired or authentication fails, then you can use something like this:

def parse(self, response):
    if response.status == 404 or response.status != 200:
        self.authenticate()
        # continue with scraping

Hope this is helpful.

Girish
  • 883
  • 8
  • 16
  • I don't think so.Am checking for authentication error after I login and regarding cookie expiration I tried resuming right after I abort.Still the same problem occurs.I have included my code in the question.Pl take a look – user May 13 '14 at 11:17
  • @user- I have faced this issue quite a bit. I haven't found a solution yet. Check this http://stackoverflow.com/questions/13724730/how-to-get-the-scrapy-failure-urls I usually get the failed urls, dump them in a pickle db and load them when I start the crawler again. I know this is not a solution, just a workaround. – Girish May 13 '14 at 12:23