125

I am trying to pass a user defined argument to a scrapy's spider. Can anyone suggest on how to do that?

I read about a parameter -a somewhere but have no idea how to use it.

bryant1410
  • 5,540
  • 4
  • 39
  • 40
L Lawliet
  • 2,565
  • 4
  • 26
  • 35

5 Answers5

234

Spider arguments are passed in the crawl command using the -a option. For example:

scrapy crawl myspider -a category=electronics -a domain=system

Spiders can access arguments as attributes:

class MySpider(scrapy.Spider):
    name = 'myspider'

    def __init__(self, category='', **kwargs):
        self.start_urls = [f'http://www.example.com/{category}']  # py36
        super().__init__(**kwargs)  # python3

    def parse(self, response)
        self.log(self.domain)  # system

Taken from the Scrapy doc: http://doc.scrapy.org/en/latest/topics/spiders.html#spider-arguments

Update 2013: Add second argument

Update 2015: Adjust wording

Update 2016: Use newer base class and add super, thanks @Birla

Update 2017: Use Python3 super

# previously
super(MySpider, self).__init__(**kwargs)  # python2

Update 2018: As @eLRuLL points out, spiders can access arguments as attributes

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
Steven Almeroth
  • 7,758
  • 2
  • 50
  • 57
  • Hey thanks for the answer. But lets say I want to pass two arguments, would I use something like: scrapy crawl myspider -a category=electronics domain=system or scrapy crawl myspider -a category=electronics -a domain=system – L Lawliet Mar 26 '13 at 18:32
  • 3
    scrapy crawl myspider -a category=electronics -a domain=system – Steven Almeroth Mar 26 '13 at 18:55
  • 1
    The above code is only partially working for me. For eg. If I define domain using `self.domain`, I'm still not able to access it outside the `__init__` method. Python throws a not defined error. BTW, why have you omitted the `super` call? PS. I'm working with the CrawlSpider class – Birla Sep 24 '14 at 10:57
  • Is it possible for me to call the same spider multiple times concurrently since i have got an argument @Siddharth – CodeGuru Apr 03 '15 at 23:21
  • 2
    @FlyingAtom Please correct me if I misunderstood, but each of these concurrent calls would be different instances of the the spider, wouldn't it ? – L Lawliet Apr 03 '15 at 23:30
  • 1
    @Birla, use self.domain=domain in constructor to populate class scope variable. – Hassan Raza Sep 08 '15 at 11:11
  • This seems both overkill and not robust. The __init__ class is optional. The `print getattr(self,'category', '')` from Hassan Raza is simpler and more robust and flexible. – nealmcb Nov 23 '19 at 04:26
  • 1
    @nealmcb `__init__` is a _method_ of the spider class. Its implementation does does not itself make the spider any less robust and it is included in the answer to show that you can declare defaults for keyword arguments but as you said it's optional. As we pointed out last year you don't need to use `getattr` you can just access arguments as attributes, e.g `self.category` or as we see in the answer `self.domain` – Steven Almeroth Nov 24 '19 at 04:45
  • Thanks! And oops - sorry about my silly "init class" typo.... I'm still unclear on what would happen if `domain` was not defined on the command line. Is there a default of None for all attributes? I suggest documenting that in the answer. It is nice to be able to incorporate the `category` in the `start_urls`. But otherwise, so far it seems preferable to me to use `getattr` with a default, which seems clearer and more concise. – nealmcb Nov 25 '19 at 15:16
54

Previous answers were correct, but you don't have to declare the constructor (__init__) every time you want to code a scrapy's spider, you could just specify the parameters as before:

scrapy crawl myspider -a parameter1=value1 -a parameter2=value2

and in your spider code you can just use them as spider arguments:

class MySpider(Spider):
    name = 'myspider'
    ...
    def parse(self, response):
        ...
        if self.parameter1 == value1:
            # this is True

        # or also
        if getattr(self, parameter2) == value2:
            # this is also True

And it just works.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
21

To pass arguments with crawl command

scrapy crawl myspider -a category='mycategory' -a domain='example.com'

To pass arguments to run on scrapyd replace -a with -d

curl http://your.ip.address.here:port/schedule.json -d spider=myspider -d category='mycategory' -d domain='example.com'

The spider will receive arguments in its constructor.


class MySpider(Spider):
    name="myspider"
    def __init__(self,category='',domain='', *args,**kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.category = category
        self.domain = domain

Scrapy puts all the arguments as spider attributes and you can skip the init method completely. Beware use getattr method for getting those attributes so your code does not break.


class MySpider(Spider):
    name="myspider"
    start_urls = ('https://httpbin.org/ip',)

    def parse(self,response):
        print getattr(self,'category','')
        print getattr(self,'domain','')

Hassan Raza
  • 3,025
  • 22
  • 35
11

Spider arguments are passed while running the crawl command using the -a option. For example if i want to pass a domain name as argument to my spider then i will do this-

scrapy crawl myspider -a domain="http://www.example.com"

And receive arguments in spider's constructors:

class MySpider(BaseSpider):
    name = 'myspider'
    def __init__(self, domain='', *args, **kwargs):
        super(MySpider, self).__init__(*args, **kwargs)
        self.start_urls = [domain]
        #

...

it will work :)

madmanick
  • 1,268
  • 10
  • 10
Siyaram Malav
  • 4,414
  • 2
  • 31
  • 31
1

Alternatively we can use ScrapyD which expose an API where we can pass the start_url and spider name. ScrapyD has api's to stop/start/status/list the spiders.

pip install scrapyd scrapyd-deploy
scrapyd
scrapyd-deploy local -p default

scrapyd-deploy will deploy the spider in the form of egg into the daemon and even it maintains the version of the spider. While starting the spider you can mention which version of spider to use.

class MySpider(CrawlSpider):

    def __init__(self, start_urls, *args, **kwargs):
        self.start_urls = start_urls.split('|')
        super().__init__(*args, **kwargs)
    name = testspider

curl http://localhost:6800/schedule.json -d project=default -d spider=testspider -d start_urls="https://www.anyurl...|https://www.anyurl2"

Added advantage is you can build your own UI to accept the url and other params from the user and schedule a task using the above scrapyd schedule API

Refer scrapyd API documentation for more details

Nagendran
  • 277
  • 2
  • 7