I want to be able to run the Scrapy web crawling framework from within Django. Scrapy itself only provides a command line tool scrapy
to execute its commands, i.e. the tool was not intentionally written to be called from an external program.
The user Mikhail Korobov came up with a nice solution, namely to call Scrapy from a Django custom management command. For convenience, I repeat his solution here:
# -*- coding: utf-8 -*-
# myapp/management/commands/scrapy.py
from __future__ import absolute_import
from django.core.management.base import BaseCommand
class Command(BaseCommand):
def run_from_argv(self, argv):
self._argv = argv
return super(Command, self).run_from_argv(argv)
def handle(self, *args, **options):
from scrapy.cmdline import execute
execute(self._argv[1:])
Instead of calling e.g. scrapy crawl domain.com
I can now do python manage.py scrapy crawl domain.com
from within a Django project. However, the options of a Scrapy command are not parsed at all. If I do python manage.py scrapy crawl domain.com -o scraped_data.json -t json
, I only get the following response:
Usage: manage.py scrapy [options]
manage.py: error: no such option: -o
So my question is, how to extend the custom management command to adopt Scrapy's command line options?
Unfortunately, Django's documentation of this part is not very extensive. I've also read the documentation of Python's optparse module but afterwards it was not clearer to me. Can anyone help me in this respect? Thanks a lot in advance!