I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG
I have the following things in place:
- a settings.py file for the project
- items and pipelines
- a crawler class which extends BaseSpider and requires arguments upon initialisation.
I can quite happily run my project using the scrapy
command as specified above, however I'm writing integration tests and I want to programatically:
- launch the crawl using the settings in
settings.py
and the crawler that has themy_crawler
name attribute (I can instantiate this class easily from my test module. - I want all the pipelines and middleware to be used as per the specification in
settings.py
. - I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.
So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's
blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)