Upsert in Scrapy + PostgreSQL Pipeline?

Question

I have a Scrapy spider that inserts the scraped data into a PostgreSQL database. It works fine on the first run, but if the primary key already exists, I get the corresponding errors

I tried to implement the upsert DO NOTHING or DO UPDATE commands I read about. I even read the Blog from Depesz about why upsert is so complicated - and while it was interesting, it was a bit "high" for me on some parts and I won't get any race problems with this project.

I tried to use this code for my pipline:

 def store_db(self, item):
    if "pos" in item:
    
        self.cur.execute("insert into horses(pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON CONFLICT (pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) DO NOTHING",
            (
            item['pos'],
            item['draw'],
            item['dwinner'],
            item['dnext'],
            item['startnumber'],
            item['pferde'],
            item['horsecountry'],
            item['odd'],
            item['jockey'],
            item['trainer'],
            item['weightkg'],
            item['alter'],
            item['headgear'],
            item['officalrating'],
            item['rp'],
            item['ts'],
            item['rprc'],
            item['date'],
            item['track'],
            item['racename']
            )) 
        self.connection.commit()
        return item

But I guess I messed it up. How do I implement the ON CONFLICT command in this pipeline correctly?

****** EDIT ***** In another table, I set date, track and racename as a primary key. If I update the code to only these three

(sel.cur.execute("insert into( date, track, racename) value(%s, %s, %s) ON CONFLICT (date, track, racename) DO NOTHING

it works as intended. But - if I add another, not primary key field like "racetype", I get this error:

      File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\twisted\internet\defer.py", line 858, in _runCallbacks
    current.result = callback(  # type: ignore[misc]
  File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 150, in f
    return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
  File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
    self.store_db(item)
  File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 42, in store_db
    item['racetype']
KeyError: 'racetype'

I read here: that "The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error" - does that mean that I can only use "ON CONFLICT" on Primary keys - and that the rest still gets processed? I thought on "DO NOTHING" the operation on this column will just be aborted, I don`t want to update anything in the tables, as there is no "new" data once it already is insertet once.

I might try this. But why does DO NOTHING work if only the primary keys are present, and breaks if there are more, non-key columns? — Phil, Mar 09 '22 at 08:52

Upsert in Scrapy + PostgreSQL Pipeline?

0 Answers0