1

I'm trying to use Scrapy on IBM cloud as a function. My __main__.py is as follows:

class AutoscoutListSpider(scrapy.Spider):
    name = "vehicles list"

    def __init__(self, params, *args, **kwargs):
        super(AutoscoutListSpider, self).__init__(*args, **kwargs)
        make = params.get("make", None)
        model = params.get("model", None)
        mileage = params.get("mileage", None)

        init_url = "https://www.autoscout24.be/nl/resultaten?sort=standard&desc=0&ustate=N%2CU&size=20&page=1&cy=B&mmvmd0={0}&mmvmk0={1}&kmto={2}&atype=C&".format(
            model, make, mileage)
        self.start_urls = [init_url]

    def parse(self, response):
        # Get total result on list load
        init_total_results = int(response.css('.cl-filters-summary-counter::text').extract_first().replace('.', ''))
        if init_total_results > 400:
            yield {"message": "There are MORE then 400 results"}
        else:
            yield {"message": "There are LESS then 400 results"}


def main(params):
    process = CrawlerProcess()
    try:
        runner = crawler.CrawlerRunner()
        runner.crawl(AutoscoutListSpider, params)
        d = runner.join()
        d.addBoth(lambda _: reactor.stop())
        reactor.run()
        return {"Success ": main_result}
    except Exception as e:
        return {"Error ": e, "params ": params}

I upload it to the as an IBM function, that is fine.

But the problem is when I run it, in python console or when I invoke IBM function, first time it's executed, but if I want to execute it second time I get an error:

{'Error ': ReactorNotRestartable(), 'params ': {'make': '9', 'model': '1624', 'mileage': '2500'}}

It is invoked like this:

IBM:

ibmcloud wsk action invoke --result ascrawler --param make 9 --param model 1624 --param mileage 2500

Python console:

main({"make":"9", "model":"1624", "mileage":"2500"})

With next code I've tried add a possibility to run it multiple times, but without success.

runner = crawler.CrawlerRunner()
runner.crawl(AutoscoutListSpider, params)
d = runner.join()
d.addBoth(lambda _: reactor.stop())
reactor.run()

Any idea how to solve it?

Boky
  • 11,554
  • 28
  • 93
  • 163
  • This is more like lambda? If so, it may not mean fresh restart everytime. You should try the request after sometime and see if it works then? – Tarun Lalwani Jul 05 '18 at 07:13
  • How do you invoke it? How long does it run? – data_henrik Jul 05 '18 at 07:18
  • @data_henrik From `python console` I invoke it as follows: `main({"make":"9", "model":"1624", "mileage":"2500"})`. And as `IBM action` : `ibmcloud wsk action invoke --result ascrawler --param make 9 --param model 1624 --param mileage 2500` . But nothing wrong there otherwise it wouldn't run first time – Boky Jul 05 '18 at 07:58

1 Answers1

0

Did you mean to use the CrawlerRunner rather than CrawlerProcess?

According to the documentation, CrawlerRunner should be used instead of CrawlerProcess "if your application is already using Twisted and you want to run Scrapy in the same reactor." This is not the case for Python actions on IBM Cloud Functions.

Changing the main method to the following code, it works correctly.

def main(params):
    process = CrawlerProcess()
    try:
        process.crawl(AutoscoutListSpider, params)
        process.start()
        return {"Success ": params}
    except Exception as e:
        return {"Error ": e, "params ": params}
James Thomas
  • 4,303
  • 1
  • 20
  • 26
  • This gives a ReactorNotRestartable error. See related question: https://stackoverflow.com/questions/61083449/reactornotrestartable-with-scrapy-when-using-google-cloud-functions – WJA Apr 09 '20 at 16:01