0

I have this piece of Python with Scrapy code,

import scrapy
import scrapy.crawler as crawler
from multiprocessing import Process, Queue
from twisted.internet import reactor

# your spider
class QuotesSpider(scrapy.Spider):
    name = "quotes"
    start_urls = ['http://quotes.toscrape.com/tag/humor/']

    def parse(self, response):
        for quote in response.css('div.quote'):
            print(quote.css('span.text::text').extract_first())


# the wrapper to make it run more times
def run_spider():
    def f(q):
        try:
            runner = crawler.CrawlerRunner()
            deferred = runner.crawl(QuotesSpider)
            deferred.addBoth(lambda _: reactor.stop())
            reactor.run()
            q.put(None)
        except Exception as e:
            q.put(e)

    q = Queue()
    p = Process(target=f, args=(q,))
    p.start()
    result = q.get()
    p.join()

    if result is not None:
        raise result


print('first run:')
run_spider()

print('\nsecond run:')
run_spider()

Right now run_spider is running even if QuotesSpider returns blank or error.

How can I make run_spider() not executed/queued when QuotesSpider() is error or blank?

Thanks

Ardhi
  • 2,855
  • 1
  • 22
  • 31
  • is threading necessary? The basic logic of scraping and executing a function after a crawl completes is already in place in your example code. Adding threading is not really helping the situation. – notorious.no Jul 31 '18 at 03:49
  • I just also recently found that, only need to think to fix the code to use only twisted. Then my real problem would be sending a callback with it to ack so the second `run_spider()` won't executed. – Ardhi Jul 31 '18 at 17:26
  • Take a look at a answer I gave on [this question](https://stackoverflow.com/questions/47552507/how-to-schedule-scrapy-crawl-execution-programmatically/47583233#47583233). It might give you an idea on how to schedule crawls. – notorious.no Aug 01 '18 at 01:50
  • Thank you for your comment, but I think mine is to get callback/errback result from previous `run_spider()` so the next will or won't run based on that callback/errback while still maintain reactor to run to receive new input. I'm still stuck on that. – Ardhi Aug 04 '18 at 06:10

0 Answers0