3

I am a newbie of python and scrapy. I used the method in this blog Running multiple scrapy spiders programmatically to run my spiders in a flask app.Here is the code:

# list of crawlers
TO_CRAWL = [DmozSpider, EPGDspider, GDSpider]

# crawlers that are running 
RUNNING_CRAWLERS = []

def spider_closing(spider):
    """
    Activates on spider closed signal
    """
    log.msg("Spider closed: %s" % spider, level=log.INFO)
    RUNNING_CRAWLERS.remove(spider)
    if not RUNNING_CRAWLERS:
        reactor.stop()

# start logger
log.start(loglevel=log.DEBUG)

# set up the crawler and start to crawl one spider at a time
for spider in TO_CRAWL:
    settings = Settings()

    # crawl responsibly
    settings.set("USER_AGENT", "Kiran Koduru (+http://kirankoduru.github.io)")
    crawler = Crawler(settings)
    crawler_obj = spider()
    RUNNING_CRAWLERS.append(crawler_obj)

    # stop reactor when spider closes
    crawler.signals.connect(spider_closing, signal=signals.spider_closed)
    crawler.configure()
    crawler.crawl(crawler_obj)
    crawler.start()

# blocks process; so always keep as the last statement
reactor.run()

Here is my spiders code:

class EPGDspider(scrapy.Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = "man"
start_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery="+term+"&submit=Feeling+Lucky"]
MONGODB_DB = name + "_" + term
MONGODB_COLLECTION = name + "_" + term

def parse(self, response):
    sel = Selector(response)
    sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
    url_list = []
    base_url = "http://epgd.biosino.org/EPGD"

    for site in sites:
        item = EPGD()
        item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
        item['genID_url'] = base_url+map(unicode.strip, site.xpath('td[1]/a/@href').extract())[0][2:]
        item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
        item['taxID_url'] = map(unicode.strip, site.xpath('td[2]/a/@href').extract())
        item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
        item['familyID_url'] = base_url+map(unicode.strip, site.xpath('td[3]/a/@href').extract())[0][2:]
        item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
        item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
        item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
        yield item

    sel_tmp = Selector(response)
    link = sel_tmp.xpath('//span[@id="quickPage"]')

    for site in link:
        url_list.append(site.xpath('a/@href').extract())

    for i in range(len(url_list[0])):
        if cmp(url_list[0][i], "#") == 0:
            if i+1 < len(url_list[0]):
                print url_list[0][i+1]
                actual_url = "http://epgd.biosino.org/EPGD/search/"+ url_list[0][i+1]
                yield Request(actual_url, callback=self.parse)
                break
            else:
                print "The index is out of range!"

As you can see, there is a parameter term = 'man' in my code, and it's part of my start urls. I don't want this parameter to be fixed, so I wonder how can I give the start url or the parameter term dynamically in my program? Just like running a spider in command line there is a way can pass parameter as below:

class MySpider(BaseSpider):

    name = 'my_spider'    

    def __init__(self, *args, **kwargs): 
      super(MySpider, self).__init__(*args, **kwargs) 

      self.start_urls = [kwargs.get('start_url')] 
And start it like: scrapy crawl my_spider -a start_url="http://some_url"

Can anybody tell me how to deal with this?

Coding_Rabbit
  • 1,287
  • 3
  • 22
  • 44

1 Answers1

12

First of all, to run multiple spiders in a script, the recommended way is to use scrapy.crawler.CrawlerProcess, where you pass spider classes and not spider instances.

To pass arguments to your spider with CrawlerProcess, you just have to add the arguments to the .crawl() call, after the spider subclass, e.g.

    process.crawl(DmozSpider, term='someterm', someotherterm='anotherterm')

Arguments passed this way are then available as spider attributes (same as with -a term=someterm on the command line)

Finally, instead of building start_urls in __init__, you can achieve the same with start_requests, and you can build initial requests like this, using self.term:

def start_requests(self):
    yield Request("http://epgd.biosino.org/"
                  "EPGD/search/textsearch.jsp?"
                  "textquery={}"
                  "&submit=Feeling+Lucky".format(self.term))
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66
  • First,thank you for your detailed answer!! I've tried `CrawlerProcess`,but there is a problem that I can't use this in Flask App, when I use that there is a bug said that signal only works in main thread,and I have ask this question [link](http://stackoverflow.com/questions/36384286/cant-call-scrapy-spiders-in-flask-web-framework),but there is no effective solutions.So do you have other method? – Coding_Rabbit Apr 18 '16 at 14:05
  • If you want to use `scrapy.crawler.Crawler`, [it needs to be instantiated with `(spidercls, settings)`](http://doc.scrapy.org/en/latest/topics/api.html#scrapy.crawler.Crawler), not only settings. e.g. `crawler = Crawler(DmozSpider, settings)` and then `crawler.crawl(term="someterm")` – paul trmbrth Apr 19 '16 at 13:51
  • The problem is that I run these spiders in a Flask App, so I should try `scrapy.crawler.Crawler` instead of `CrawlerProcess`? – Coding_Rabbit Apr 19 '16 at 14:08
  • I don't know how to run scrapy spider in a Flask app. I'll ask around – paul trmbrth Apr 19 '16 at 14:44
  • I found that I use `scrapy - 0.24.0` instead of `scrapy -1.0`, and in `scrapy - 0.24.0`,the Crawler only has one parameter `settings`, it's a little different from the latest one. – Coding_Rabbit Apr 20 '16 at 01:41
  • Ok, then check 0.24 docs, there's [an example with a spider being passed arguments](http://doc.scrapy.org/en/0.24/topics/practices.html#run-scrapy-from-a-script): – paul trmbrth Apr 20 '16 at 07:41
  • Sorry, I have read this official document,but I don't see there is an example passing arguments...I have also tried the `scrapy -1.0` method `Crawler Runner`, does the method `Crawler Runner` has a way to pass argument? – Coding_Rabbit Apr 20 '16 at 08:02
  • Thanks @paultrmbrth for the feedback. I will update my blog post as you suggested. –  Jun 24 '17 at 10:44