6

I have a Scrapy project that uses custom middleware and a custom pipeline to check and store entries in a Postgres DB. The middleware looks a bit like this:

class ExistingLinkCheckMiddleware(object):

    def __init__(self):

        ... open connection to database

    def process_request(self, request, spider):

        ... before each request check in the DB
        that the page hasn't been scraped before

The pipeline looks similar:

class MachinelearningPipeline(object):

    def __init__(self):

        ... open connection to database

    def process_item(self, item, spider):

        ... save the item to the database

It works fine, but I can't find a way to cleanly close these database connections when the spider finishes, which irks me.

Does anyone know how to do that?

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Jamie Brown
  • 1,023
  • 10
  • 13

1 Answers1

7

I think the best way to do it is to use scrapy's signal spider_closed, e.g.:

from scrapy import signals
from scrapy.xlib.pydispatch import dispatcher

class ExistingLinkCheckMiddleware(object):

    def __init__(self):
        # open connection to database

        dispatcher.connect(self.spider_closed, signals.spider_closed)

    def spider_closed(self, spider, reason):
        # close db connection

    def process_request(self, request, spider):
        # before each request check in the DB
        # that the page hasn't been scraped before

See also:

Hope that helps.

Community
  • 1
  • 1
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195