3

I followed the advice from these two posts as I am also trying to create a generic scrapy spider:

How to pass a user defined argument in scrapy spider

Creating a generic scrapy spider

But I'm getting an error that the variable I am supposed to be passing as an argument is not defined. Am I missing something in my init method?

Code:

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector

from data.items import DataItem

class companySpider(BaseSpider):
    name = "woz"

    def __init__(self, domains=""):
        '''
        domains is a string
        '''
        self.domains = domains

    deny_domains = [""]
    start_urls = [domains]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        sites = hxs.select('/html')
        items = []
        for site in sites:
            item = DataItem()
            item['text'] = site.select('text()').extract()
            items.append(item)
        return items

Here is my command-line:

scrapy crawl woz -a domains="http://www.dmoz.org/Computers/Programming/Languages/Python/Books/"

And here is the error:

NameError: name 'domains' is not defined
Community
  • 1
  • 1
jstaker7
  • 1,226
  • 4
  • 15
  • 29
  • I forgot to reference the variable in start_urls as self.domains, but now the error says that self is not defined. I have an answer to my own question but have to wait 4 hours before I can post. To be continued... – jstaker7 Jul 19 '13 at 20:37

1 Answers1

6

you should call super(companySpider, self).__init__(*args, **kwargs) at the beginning of your __init__.

def __init__(self, domains="", *args, **kwargs):
    super(companySpider, self).__init__(*args, **kwargs)
    self.domains = domains

In your case where your first requests depend on a spider argument, what I usually do is only override start_requests() method, without overriding __init__(). The parameter name from the command line is aleady available as an attribute to the spider:

class companySpider(BaseSpider):
    name = "woz"
    deny_domains = [""]

    def start_requests(self):
        yield Request(self.domains) # for example if domains is a single URL

    def parse(self, response):
        ...
paul trmbrth
  • 20,518
  • 4
  • 53
  • 66