I already have some working Spiders and code to achieve what I want, but I was looking for advice on how to consolidate things more efficiently for the project I'm working on.
My current process involves:
- Within Scrapy: manually create Items using
scrapy.Item
Within Scrapy: crawl, outputing each Item row to a JSON Lines (JL) file
- Current Pipeline:
#pipelines.py class MyPipeline(object): def process_item(self, item, spider): for field in item.fields: item.setdefault(field, None) return item
Outside Scrapy w/ SQL Alchemy:
truncate incoming table
,bulk insert
JL file using Pandas to_sql- Outside Scrapy w/ SQL Alchemy:
update incoming table row_uuid column
(md5 hash of all pertinent columns) - Outside Scrapy w/ SQL Alchemy: upsert (
insert...on conflict...where row_uuid is distinct from
) incoming data table into source data table - Outside Scrapy w/ SQL Alchemy:
delete
from source table as necessary (404 errors, etc)
Ideally, I want to perform all these actions within Scrapy using a proper pipeline. I've seen dataset mentioned. Would some combination of open_spider
, process_item
, and close_spider
help? Some questions I have:
- Is it possible to populate/define Scrapy Items directly from an existing database table without listing the columns manually?
- If you have multiple methods in a Spider (parse, parse_detail, etc) that each have their own Item, would the Pipeline be able to insert to the proper database table?
- Is it possible to bulk insert X Items at a time, rather than one Item at a time?
Would the below be a potential approach? I assume other changes would be necessary...
#pipelines.py class SqlPipeline(object): def __init__(self, db_conn): #Connect to DB here? def open_spider(self, spider): #Truncate specific incoming table here? def process_item(self, item, spider): #(Bulk) insert item row(s) into specific incoming table? #Where would you define the table for this? def close_spider(self, spider): #Update row_uuid for specific incoming table? #Do upsert and delete rows for specific source table? #Close DB connection here?
Thanks for your help!