Scrapy + PostgreSQL - Automatic Items and Pipeline for Custom ETL (Truncate > Insert > Upsert > Delete)

Question

I already have some working Spiders and code to achieve what I want, but I was looking for advice on how to consolidate things more efficiently for the project I'm working on.

My current process involves:

Within Scrapy: manually create Items using scrapy.Item

Within Scrapy: crawl, outputing each Item row to a JSON Lines (JL) file

Current Pipeline:

#pipelines.py
class MyPipeline(object):
    def process_item(self, item, spider):
        for field in item.fields:
            item.setdefault(field, None)
        return item

Outside Scrapy w/ SQL Alchemy: truncate incoming table, bulk insert JL file using Pandas to_sql
Outside Scrapy w/ SQL Alchemy: update incoming table row_uuid column (md5 hash of all pertinent columns)
Outside Scrapy w/ SQL Alchemy: upsert (insert...on conflict...where row_uuid is distinct from) incoming data table into source data table
Outside Scrapy w/ SQL Alchemy: delete from source table as necessary (404 errors, etc)

Ideally, I want to perform all these actions within Scrapy using a proper pipeline. I've seen dataset mentioned. Would some combination of open_spider, process_item, and close_spider help? Some questions I have:

Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
Is it possible to bulk insert X Items at a time, rather than one Item at a time?

Would the below be a potential approach? I assume other changes would be necessary...

#pipelines.py
class SqlPipeline(object):
    def __init__(self, db_conn):
        #Connect to DB here?

    def open_spider(self, spider):
        #Truncate specific incoming table here?

    def process_item(self, item, spider):
        #(Bulk) insert item row(s) into specific incoming table?
        #Where would you define the table for this?

    def close_spider(self, spider):
        #Update row_uuid for specific incoming table?
        #Do upsert and delete rows for specific source table?
        #Close DB connection here?

Thanks for your help!

score 2 · Accepted Answer · answered Jan 12 '20 at 10:42

The pipelines in Scrapy are used to do exactly what you are saying. Answering your questions:

Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?

I don't understand the "listening the columns manually". I am going to guess that you have a table in a database with a bunch of columns. Those columns have to be defined in your items because they will be mapped to the DB. If not, how do you expect map every field into the column in the table?

If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?

Yes. You can define multiple pipelines (with their weight) in order to separate different processes per Item and separated properly.

Is it possible to bulk insert X Items at a time, rather than one Item at a time? Would the below be a potential approach?

Yes. Of course! You have to define it into your pipeline. Your logic can be different in each one!

Scrapy + PostgreSQL - Automatic Items and Pipeline for Custom ETL (Truncate > Insert > Upsert > Delete)

1 Answers1