9

I want to start a crawler in Scrapy from a Python module. I want to essentially mimic the essence of $ scrapy crawl my_crawler -a some_arg=value -L DEBUG

I have the following things in place:

  • a settings.py file for the project
  • items and pipelines
  • a crawler class which extends BaseSpider and requires arguments upon initialisation.

I can quite happily run my project using the scrapy command as specified above, however I'm writing integration tests and I want to programatically:

  • launch the crawl using the settings in settings.py and the crawler that has the my_crawler name attribute (I can instantiate this class easily from my test module.
  • I want all the pipelines and middleware to be used as per the specification in settings.py.
  • I'm quite happy for the process to be blocked until the crawler has finished. The pipelines dump things in a DB and it's the contents of the DB I'll be inspecting after the crawl is done to satisfy my tests.

So, can anyone help me? I've seen some examples on the net but they are either hacks for multiple spiders, or getting around Twisted's blocking nature, or don't work with Scrapy 0.14 or above. I just need something real simple. :-)

Edwardr
  • 2,906
  • 3
  • 27
  • 30
  • 1
    what is wrong with `subprocess.check_output(['scrapy', ...], stderr=subprocess.STDOUT)`? – jfs Jun 26 '12 at 18:51
  • 1
    I feel that starting another process and executing the external script is a bit of a hack. I know it's possible to do it from within the same process (obviously), and I would like to know how to do it myself. :-) – Edwardr Jun 26 '12 at 20:20
  • 2
    it is not a hack if it is an integration test otherwise you would depend on specific version of scrapy (internals change faster than command line interface) – jfs Jun 26 '12 at 20:40
  • Check this post out: http://www.tryolabs.com/Blog/2011/09/27/calling-scrapy-python-script/ – John Mee Jun 27 '12 at 23:12
  • @J.F.Sebastian I'm working on OSX and if I use only two subprocesses, i.e. two Scrapy instances, the second instance consumes 90% CPU. This is not feasible. – pemistahl Sep 03 '12 at 09:52
  • @Peter: What makes you think that it is due to a subprocess overhead? If you have a specific question You could [ask](http://stackoverflow.com/questions/ask). – jfs Sep 03 '12 at 10:11
  • @J.F.Sebastian I wasn't precise enough. Even if I don't use subprocesses but start two Scrapy instances from within the command line, for example, the second instance consumes always an enormous amount of CPU. This happens independently of how I start the instances, either with using subprocesses or not. I assume this is the general case with Scrapy as it was not built to run several instances of it in parallel. Maybe this is also an issue with Twisted. – pemistahl Sep 03 '12 at 10:46
  • Does this answer your question? [How to run Scrapy from within a Python script](https://stackoverflow.com/questions/13437402/how-to-run-scrapy-from-within-a-python-script) – Jakub Kukul Jun 06 '20 at 22:28

2 Answers2

7
from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy.settings import Settings
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider

spider = FollowAllSpider(domain='scrapinghub.com')
crawler = Crawler(Settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start()
reactor.run() # the script will block here until the spider_closed signal was sent

See this part of the docs

Wilfred Hughes
  • 29,846
  • 15
  • 139
  • 192
3

@wilfred's answer from official docs works fine except logging part, here's mine:

from twisted.internet import reactor
from scrapy.crawler import Crawler
from scrapy import log, signals
from testspiders.spiders.followall import FollowAllSpider
from scrapy.utils.project import get_project_settings

spider = FollowAllSpider()
crawler = crawler = Crawler(get_project_settings())
crawler.signals.connect(reactor.stop, signal=signals.spider_closed)
crawler.configure()
crawler.crawl(spider)
crawler.start()
log.start_from_settings(get_project_settings())
reactor.run()
yegong
  • 739
  • 1
  • 7
  • 15