Access Instance of scrapy pipeline class

Question

I want to access the variable self.cursor to make use of the active postgreSQL connection, but i am unable to figure out how to access the scrapy's instance of the pipeline class.

class ScrapenewsPipeline(object):

  def open_spider(self, spider):
      self.connection = psycopg2.connect(
        host= os.environ['HOST_NAME'],
        user=os.environ['USERNAME'],
        database=os.environ['DATABASE_NAME'],
        password=os.environ['PASSWORD'])
      self.cursor = self.connection.cursor()
      self.connection.set_session(autocommit=True)


  def close_spider(self, spider):
      self.cursor.close()
      self.connection.close() 


  def process_item(self, item, spider):
      print ("Some Magic Happens Here")


  def checkUrlExist(self, item):
      print("I want to call this function from my spider to access the 
    self.cursor variable")

Please note, i realise i can get access to process_item by using yield item but that function is doing other stuff and i want access of the connection via self.cursor in checkUrlExist and be able to call the instance of class from my spiders at will! Thank you.

objectName is not known to me, the pipelines class is called when the spider starts automatically, i want to hook an instance to that instance of the class! :) — atb00ker, Dec 04 '17 at 11:28
Maybe you should consider `getattr` https://stackoverflow.com/questions/4075190/what-is-getattr-exactly-and-how-do-i-use-it#4076099 — saud, Dec 04 '17 at 14:03

score 3 · Accepted Answer · edited Apr 25 '19 at 14:42

3

You can access all of your spider class variables by doing spider.variable_name here.

class MySpider(scrapy.Spider):
        name = "myspider"
        any_variable = "any_value"

Your pipeline here

class MyPipeline(object):
    def process_item(self, item, spider):
        spider.any_variable

I suggest you to create a connection in your Spider class just like I declared any_variable in my example, that will be accessible in your Spider using self.any_variable and in your pipelines, it will be accessible via spider.any_variable

edited Apr 25 '19 at 14:42

vezunchik

3,669
3
16
25

answered Dec 03 '17 at 10:53

Umair Ayub

19,358
14
72
146

I have 60 spiders, in this case all of them with have their own postgreSQL connections, i have only limited RAM because of which this will not prove useful for me. – atb00ker Dec 04 '17 at 11:26

score 1 · Answer 2 · answered Mar 16 '20 at 12:21

I realize I'm a little late to the party here but in case any one is looking for the correct answer to this question, any pipeline or middleware (or for that matter, downloader etc.) instance can be accessed through the crawler object which controls everything else. You can access the crawler in a spider by using the from_crawler classmethod to set a .crawler attribute at the time of initialization.

Doing some digging around in the scrapy shell, you should be able to find the instance of any object being used in the current crawl eg.

Spider middlewares crawler.engine.scraper.spidermw.middlewares
Downloader middlewares crawler.engine.downloader.middleware.middlewares
Item pipelines crawler.engine.scraper.itemproc.middlewares (think so. This is just based on a rudimentary exploration in the scrapy shell)

Please note that I'm not advocating that one should do this for accessing a database connection object from a spider. Just that any Scrapy object instance can be accessed through the crawler object which is the answer to the OP's question as per the title.

Access Instance of scrapy pipeline class

2 Answers2