2

I'm working with scrapy. I have a spider that starts with:

class For_Spider(Spider):

    name = "for"
    table = 'hello' # creating dummy attribute. will be overwritten

    def start_requests(self):

        self.table = self.dc # dc is passed in

I have the following pipeline :

class DynamicSQLlitePipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        # Here, you get whatever value was passed through the "table" parameter
        table = getattr(crawler.spider, "table")
        return cls(table)

    def __init__(self,table):
        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            db = dataset.connect(db_path)
            table_name = table[0:3]  # FIRST 3 LETTERS
            self.my_table = db[table_name]

When I start the spider with:

scrapy crawl for -a dc=input_string -a records=1

After stepping through the execution repeatly and with help from questions like What is the relationship between the crawler object with spider and pipeline objects? , It appears that the order of execution is :

1) For_spider
2) DynamicSQLlitePipeline
3) start_requests

The parameter in the spider "table" is passed to the DynamicSQLlitePipeline object by the from_crawler method which has access to different components of the scrapy system. Table is the initialized as "hello" (a dummy variable) that I set. after 1 and 2 above execution returns to the spider and the start_requests begins. The command line parameters only become available inside start_requests, so its too late to set the table name dynamically as the pipeline has already been instantiated.

Therefore I don't know if there is a way to set the pipeline table name dynamically. How can I do this.

edit:

elRuLL is correct, and his solution works. I looked through the spider object in step 1 and did not find any parameters listed in the spider. Am I missing them?

>>> Spider.__dict__
mappingproxy({'__module__': 'scrapy.spiders', '__doc__': 'Base class for scrapy spiders. All spiders must inherit from this\n    class.\n    ', 'name': None, 'custom_settings': None, '__init__': <function Spider.__init__ at 0x00000000047A6D90>, 'logger': <property object at 0x0000000003E0E598>, 'log': <function Spider.log at 0x00000000047A6EA0>, 'from_crawler': <classmethod object at 0x0000000003B28278>, 'set_crawler': <function Spider.set_crawler at 0x00000000047C9048>, '_set_crawler': <function Spider._set_crawler at 0x00000000047C90D0>, 'start_requests': <function Spider.start_requests at 0x00000000047C9158>, 'make_requests_from_url': <function Spider.make_requests_from_url at 0x00000000047C91E0>, 'parse': <function Spider.parse at 0x00000000047C9268>, 'update_settings': <classmethod object at 0x0000000003912C88>, 'handles_request': <classmethod object at 0x0000000003E0B7F0>, 'close': <staticmethod object at 0x0000000004756BA8>, '__str__': <function Spider.__str__ at 0x00000000047C9488>, '__repr__': <function Spider.__str__ at 0x00000000047C9488>, '__dict__': <attribute '__dict__' of 'Spider' objects>, '__weakref__': <attribute '__weakref__' of 'Spider' objects>})
user1592380
  • 34,265
  • 92
  • 284
  • 515

2 Answers2

2

In documentation there is example how to create pipeline to write in MongoDB

It uses def open_spider(self, spider): to open database.
And there is variable spider which gives you access to spider so you can get your variable

def open_spider(self, spider):

    table = spider.table

So it could be (similar to code from documentation)

class DynamicSQLlitePipeline(object):

    def open_spider(self, spider):

        table = spider.table

        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            self.db = dataset.connect(db_path)
            table_name = table[0:3]  # FIRST 3 LETTERS
            self.my_table = self.db[table_name]
         # ... rest ...

    def close_spider(self, spider):
        self.db.close()

    def process_item(self, item, spider):
        self.my_table.insert_one(dict(item))
        return item
furas
  • 134,197
  • 12
  • 106
  • 148
1

Scrapy arguments are passed dynamically to the spider instance, which can be used later within the Spider with the self variable.

Now, start_requests is not the first place where you can check for the spider arguments, of course that would be the constructor of the Spider instance (but be careful, because scrapy also passed important arguments into its constructor).

Now your problem was that you were trying to access the Class variable table on the Pipeline constructor (because the from_crawler is executed before the constructor), which is incorrect, because you were assigning self.table on start_requests which didn't happen just yet.

The correct way would be to get getattr(crawler.spider, 'dc') directly, as the spider got the dc variable from the command line.

eLRuLL
  • 18,488
  • 9
  • 73
  • 99
  • Thanks, this answered my question. I did look inside the spider but did not see the spider arguments (Please see my edit above). Do you know where the spider arguments are stored? – user1592380 Dec 27 '17 at 15:38
  • that's the magic of Python. What I see is that you are checking the attributes of the "Class" Spider, which of course doesn't have that variable. The command line arguments are only added to the Spider "instance", so you can check it with `dir(self)` – eLRuLL Dec 27 '17 at 17:09