2

I have a list of URLs. I want to crawl each of these. Please note

  • adding this array as start_urls is not the behavior I'm looking for. I would like this to run one by one in separate crawl sessions.
  • I want to run Scrapy multiple times in the same process
  • I want to run Scrapy as a script, as covered in Common Practices, and not from the CLI.

The following code is a full, broken, copy-pastable example. It basically tries to loop through a list of URLs and start the crawler on each of them. This is based on the Common Practices documentation.

from urllib.parse import urlparse
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.spiders import CrawlSpider


class MySpider(CrawlSpider):
    name = 'my-spider'

    def __init__(self, start_url,  *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [start_url]
        self.allowed_domains = [urlparse(start_url).netloc]


urls = [
    'http://testphp.vulnweb.com/',
    'http://testasp.vulnweb.com/'
]

configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()

for url in urls:
    runner.crawl(MySpider, url)
    reactor.run()

The problem with the above is that it hangs after the first URL; the second URL is never crawled and nothing happens after this:

2018-08-13 20:28:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://testphp.vulnweb.com/> (referer: None)
[...]
2018-08-13 20:28:44 [scrapy.core.engine] INFO: Spider closed (finished)
Jimmy Sanchez
  • 721
  • 2
  • 7
  • 17

1 Answers1

3

The reactor.run() will block your loop forever from the start. The only way around this is to play by the twisted rules. One way to do so is by replacing your loop with a twisted specific asynchronous loop like so:

# from twisted.internet.defer import inlineCallbacks
...

@inlineCallbacks
def loop_urls(urls):
    for url in urls:
        yield runner.crawl(MySpider, url)
    reactor.stop()

loop_urls(urls)
reactor.run()

and this magic roughly translates to:

def loop_urls(urls):
    url, *rest = urls
    dfd = runner.crawl(MySpider, url)
    # crawl() returns a deferred to which a callback (or errback) can be attached
    dfd.addCallback(lambda _: loop_urls(rest) if rest else reactor.stop())

loop_urls(urls)
reactor.run()

which you could use also but it's far from pretty.

John Smith
  • 676
  • 1
  • 4
  • 7