1

I want to access the variable self.cursor to make use of the active postgreSQL connection, but i am unable to figure out how to access the scrapy's instance of the pipeline class.

class ScrapenewsPipeline(object):

  def open_spider(self, spider):
      self.connection = psycopg2.connect(
        host= os.environ['HOST_NAME'],
        user=os.environ['USERNAME'],
        database=os.environ['DATABASE_NAME'],
        password=os.environ['PASSWORD'])
      self.cursor = self.connection.cursor()
      self.connection.set_session(autocommit=True)


  def close_spider(self, spider):
      self.cursor.close()
      self.connection.close() 


  def process_item(self, item, spider):
      print ("Some Magic Happens Here")


  def checkUrlExist(self, item):
      print("I want to call this function from my spider to access the 
    self.cursor variable")

Please note, i realise i can get access to process_item by using yield item but that function is doing other stuff and i want access of the connection via self.cursor in checkUrlExist and be able to call the instance of class from my spiders at will! Thank you.

atb00ker
  • 957
  • 13
  • 24
  • `objectName.cursor`? – saud Dec 03 '17 at 07:09
  • objectName is not known to me, the pipelines class is called when the spider starts automatically, i want to hook an instance to that instance of the class! :) – atb00ker Dec 04 '17 at 11:28
  • Maybe you should consider `getattr` https://stackoverflow.com/questions/4075190/what-is-getattr-exactly-and-how-do-i-use-it#4076099 – saud Dec 04 '17 at 14:03

2 Answers2

3

You can access all of your spider class variables by doing spider.variable_name here.

class MySpider(scrapy.Spider):
        name = "myspider"
        any_variable = "any_value"

Your pipeline here

class MyPipeline(object):
    def process_item(self, item, spider):
        spider.any_variable

I suggest you to create a connection in your Spider class just like I declared any_variable in my example, that will be accessible in your Spider using self.any_variable and in your pipelines, it will be accessible via spider.any_variable

vezunchik
  • 3,669
  • 3
  • 16
  • 25
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
  • I have 60 spiders, in this case all of them with have their own postgreSQL connections, i have only limited RAM because of which this will not prove useful for me. – atb00ker Dec 04 '17 at 11:26
1

I realize I'm a little late to the party here but in case any one is looking for the correct answer to this question, any pipeline or middleware (or for that matter, downloader etc.) instance can be accessed through the crawler object which controls everything else. You can access the crawler in a spider by using the from_crawler classmethod to set a .crawler attribute at the time of initialization.

Doing some digging around in the scrapy shell, you should be able to find the instance of any object being used in the current crawl eg.

  1. Spider middlewares crawler.engine.scraper.spidermw.middlewares
  2. Downloader middlewares crawler.engine.downloader.middleware.middlewares
  3. Item pipelines crawler.engine.scraper.itemproc.middlewares (think so. This is just based on a rudimentary exploration in the scrapy shell)

Please note that I'm not advocating that one should do this for accessing a database connection object from a spider. Just that any Scrapy object instance can be accessed through the crawler object which is the answer to the OP's question as per the title.

krypto07
  • 280
  • 1
  • 8