I have a list of URLs. I want to crawl each of these. Please note
- adding this array as
start_urls
is not the behavior I'm looking for. I would like this to run one by one in separate crawl sessions. - I want to run Scrapy multiple times in the same process
- I want to run Scrapy as a script, as covered in Common Practices, and not from the CLI.
The following code is a full, broken, copy-pastable example. It basically tries to loop through a list of URLs and start the crawler on each of them. This is based on the Common Practices documentation.
from urllib.parse import urlparse
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
name = 'my-spider'
def __init__(self, start_url, *args, **kwargs):
super(MySpider, self).__init__(*args, **kwargs)
self.start_urls = [start_url]
self.allowed_domains = [urlparse(start_url).netloc]
urls = [
'http://testphp.vulnweb.com/',
'http://testasp.vulnweb.com/'
]
configure_logging({'LOG_FORMAT': '%(levelname)s: %(message)s'})
runner = CrawlerRunner()
for url in urls:
runner.crawl(MySpider, url)
reactor.run()
The problem with the above is that it hangs after the first URL; the second URL is never crawled and nothing happens after this:
2018-08-13 20:28:44 [scrapy.core.engine] DEBUG: Crawled (200) <GET http://testphp.vulnweb.com/> (referer: None)
[...]
2018-08-13 20:28:44 [scrapy.core.engine] INFO: Spider closed (finished)