8

I'm working with sqlalchemy 1.1 and scrapy. I'm currently using a a pipeline to store extracted data in a sqllite table via sqlalchemy . I'd like to dynamically create a table to accommodate the item being scraped.

My static pipeline element looks like:

class SQLlitePipeline(object):

    def __init__(self):
        db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
        _engine = create_engine(db_path)
        _connection = _engine.connect()
        _metadata = MetaData()
        _stack_items = Table(table_name, _metadata,
                             Column("id", Integer, primary_key=True),
                             Column("value", Text))
                             Column("value2", Text))
        _metadata.create_all(_engine)
        self.connection = _connection
        self.stack_items = _stack_items

    def process_item(self, item, spider):

            try:
                ins_query = self.stack_items.insert().values(
                value=item['value'],
                value2=item['value2'],)
                self.connection.execute(ins_query)
            except IntegrityError:
                    print('THIS IS A DUP')
            return item

items.py:

class Filtered_Item(scrapy.Item):

    value= scrapy.Field()
    value2= scrapy.Field()

How can I modify the pipeline above to dynamically create and insert the filtered item's values instead of having these hard coded in like they are now?

user1592380
  • 34,265
  • 92
  • 284
  • 515

3 Answers3

7

There's actually a package out there that can help you out with this.

Check out: dataset: databases for lazy people

Here's an excerpt from the page:

Features

Automatic schema:

If a table or column is written that does not exist in the database, it will be created automatically.

Upserts:

Records are either created or updated, depending on whether an existing version can be found. Query helpers for simple queries such as all rows in a table or all distinct values across a set of columns.

Compatibility:

Being built on top of SQLAlchemy, dataset works with all major databases, such as SQLite, PostgreSQL and MySQL.

Scripted exports:

Data can be exported based on a scripted configuration, making the process easy and replicable.

Community
  • 1
  • 1
Alex Luis Arias
  • 1,313
  • 1
  • 14
  • 27
4

Here's what I came up with based on Alex's dataset recommendation above:

import dataset

class DynamicSQLlitePipeline(object):

    @classmethod
    def from_crawler(cls, crawler):
        # Here, you get whatever value was passed through the "target" parameter
        table_name = getattr(crawler.spider, "target")
        return cls(table_name)

    def __init__(self,table_name):
        try:
            db_path = "sqlite:///"+settings.SETTINGS_PATH+"\\data.db"
            db = dataset.connect(db_path)
            self.my_table = db[table_name]
        except Exception:
            traceback.exec_print()

    def process_item(self, item, spider):

        try:
            self.my_table.insert(dict(item))
        except IntegrityError:
                print('THIS IS A DUP')
        return item

Hope it helps.

user1592380
  • 34,265
  • 92
  • 284
  • 515
3

This is not a direct answer to the question, but an alternative way to approach the problem.

How can I modify the pipeline above to dynamically create and insert the filtered item's values instead of having these hard coded in like they are now?

What I hear is that you don't want to have a pre-defined table schema and what your database to adjust to the fields you are scraping. Well, this sounds a lot like you need a schemaless database.

Think about switching to MongoDB or other NoSQL schemaless storages. The Scrapy documentation even provides an example of a Python+MongoDB pipeline which would insert a scraped item into a MongoDB collection ("table" in SQL terms) as a JSON document:

def process_item(self, item, spider):
    self.db[self.collection_name].insert(dict(item))
    return item

And, what important is - it does not matter what are the item fields - there is no pre-defined structure for your collection document.

This is just a thought - I don't know much about your project requirements and possible constraints.

alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Thanks for looking at this. I'm trying to create a a pipeline object that I could add to any project that would act to load the items into a table with basically 0 configuration.I will read the docs and your link.. – user1592380 Dec 19 '16 at 15:31