2

I created a small Scrapy project with this structure:

scrapyProject/
 ├── scrapy.cfg
 └── scrapyProject
    ├── __init__.py
    ├── items.py
    ├── pipelines.py
    ├── settings.py
    └── spiders
        ├── crawl_products.py
        └── __init__.py

The crawl_products.py contain the spider products_spider. To start the spider I am using:

scrapy crawl products_spider

Now I want to start the spider from another python script and wait until its execution end.

IF IT CAN HELP : The other script from which I want to run the spider is a django view

farhawa
  • 10,120
  • 16
  • 49
  • 91

1 Answers1

2

You can find half of the solution in this very good explanation in the scrapy docs

BUT, and that's the more important half of the solution, never ever run a scraper directly from a django view (neither from some other web framework).

Please, I have seen this way too often, and doing so will block your web app. As a result your view will run into a browser timeout and at some point your app won't be able to process other requests.

The clean solution here is to use a background process that runs the scraper. A good library for this is celery and this topic has been discussed in detail here already: Running Scrapy spiders in a Celery task

Done Data Solutions
  • 2,156
  • 19
  • 32
  • Celery is overkill for Django. Use django-q – frenzy Mar 10 '19 at 17:07
  • Django-q might be more light-weight, but in this case celery has the big advantage of running completely outside of django. Integrating scrapy scrapers into something that's controlled by django can be a huge pain - at least it was on every occasion when I tried in the past ... – Done Data Solutions Mar 10 '19 at 18:47