2

I've written a few spiders that pull similar data from different sources. I've also written a pipeline that allows this data to be put in a database. I want to be able to use the same code for multiple spiders to output to different tables, named dynamically from the spider name.

Here is the pipeline.py code:

class DbPipeline(object):
def __init__(self):
    """
    Initialises database connection and sessionmaker.
    Creates table if it doesn't exist.
    """
    engine = db_connect()
    create_output_table(engine)
    self.Session = sessionmaker(bind=engine)


def process_item(self, item, spider):

    """
    Saves scraped products in database
    """

    exists = self.check_item_exists(item)

    if not exists:
        session = self.Session()
        product = Products(**item)
        try:
            session.add(product)
            session.commit()
        except:
            session.rollback()
            raise
        finally:
            session.close()


    return item

def check_item_exists(self,item):

    session = self.Session()
    product = Products(**item)

    result = session.query(Products).filter(Products.title == item['title']).first()

    return result is not None

And here is the model.py file:

    DeclarativeBase = declarative_base()

def create_output_table(engine):
    DeclarativeBase.metadata.create_all(engine)

def db_connect():
    """
    Connects to database from settings defined in settings.py 
    Returns an sqlalchemy engine instance
    """

    return create_engine(URL(**settings.DATABASE))

class Products(DeclarativeBase):
    """Sqlalchemy table model"""
    __tablename__ = "name"

    id = Column(Integer, primary_key=True)
    title = Column('title', String(200))
    price = Column('price', String(10), nullable=True)
    url = Column('url', String(200), nullable=True)

What i'm trying to do is get the __tablename__ variable to be the same as the spider name, which I can easily do in the process_item function as it is passed a spider object and can use spider.name and assign it to a class variable, however the function will run after the table is created/defined. How can I go about getting the spider name outside of the process_item function in the pipelines.py file?

Edit: I've tried the solutions listed in How to access scrapy settings from item Pipeline however access to the 'settings' doesn't give me access to the attributes assigned to the current spider running. I need to dynamically get the name of the spider based on what spider is running the pipelines. Thanks

JHam
  • 23
  • 6
  • 1
    Possible duplicate of [How to access scrapy settings from item Pipeline](https://stackoverflow.com/questions/14075941/how-to-access-scrapy-settings-from-item-pipeline) – gangabass Aug 23 '18 at 00:54
  • @gangabass hey, i've updated my question with why it's not a duplicate – JHam Aug 23 '18 at 02:37
  • do you want to get current spider name in your `check_item_exists`? – gangabass Aug 23 '18 at 03:01
  • If I could just get that name anywhere in that file it will be something, but I need it to really be in the class Products. I'm currently trying out how to use the from_crawler() from the post you linked method but i'm not entirely sure it will help. I know crawler.spider.name might work, but not sure if this will just return the DefaultSpider name and not the currently running one. – JHam Aug 23 '18 at 03:20

1 Answers1

1

It's pretty easy to get current spider name in your create_output_table:

class DbPipeline(object):


    @classmethod
    def from_crawler(cls, crawler):
        return cls(crawler.spider.name)

    def __init__(self, spider_name):
        """
        Initializes database connection and sessionmaker.
        Creates deals table.
        """
        engine = db_connect()
        create_output_table(engine, spider_name)
        ......

and (in models.py):

def create_output_table(engine, spider_name):
    # now you have your spider_name
    DeclarativeBase.metadata.create_all(engine)

The problem here is that Scrapy process your models.py file before your pipelines.py. So you need to find a way to generate your SQLAlchemy model later. You can use this thread as a starting point: Dynamically setting __tablename__ for sharding in SQLAlchemy?

gangabass
  • 10,607
  • 2
  • 23
  • 35