I have a Scrapy spider that inserts the scraped data into a PostgreSQL database. It works fine on the first run, but if the primary key already exists, I get the corresponding errors
I tried to implement the upsert DO NOTHING or DO UPDATE commands I read about. I even read the Blog from Depesz about why upsert is so complicated - and while it was interesting, it was a bit "high" for me on some parts and I won't get any race problems with this project.
I tried to use this code for my pipline:
def store_db(self, item):
if "pos" in item:
self.cur.execute("insert into horses(pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) values(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s) ON CONFLICT (pos, draw, dwinner, dnext, startnumber, horsename, horsecountry, odds, jockey, trainer, weight, age, headgear, ofr, rp, ts, rprc, date, track, racename) DO NOTHING",
(
item['pos'],
item['draw'],
item['dwinner'],
item['dnext'],
item['startnumber'],
item['pferde'],
item['horsecountry'],
item['odd'],
item['jockey'],
item['trainer'],
item['weightkg'],
item['alter'],
item['headgear'],
item['officalrating'],
item['rp'],
item['ts'],
item['rprc'],
item['date'],
item['track'],
item['racename']
))
self.connection.commit()
return item
But I guess I messed it up. How do I implement the ON CONFLICT command in this pipeline correctly?
****** EDIT ***** In another table, I set date, track and racename as a primary key. If I update the code to only these three
(sel.cur.execute("insert into( date, track, racename) value(%s, %s, %s) ON CONFLICT (date, track, racename) DO NOTHING
it works as intended. But - if I add another, not primary key field like "racetype", I get this error:
File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\twisted\internet\defer.py", line 858, in _runCallbacks
current.result = callback( # type: ignore[misc]
File "C:\Anaconda3\envs\virtual_workspace\lib\site-packages\scrapy\utils\defer.py", line 150, in f
return deferred_from_coro(coro_f(*coro_args, **coro_kwargs))
File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 30, in process_item
self.store_db(item)
File "C:\Users\Schalumpa\projects\jsontest\jsontest\pipelines.py", line 42, in store_db
item['racetype']
KeyError: 'racetype'
I read here: that "The optional ON CONFLICT clause specifies an alternative action to raising a unique violation or exclusion constraint violation error" - does that mean that I can only use "ON CONFLICT" on Primary keys - and that the rest still gets processed? I thought on "DO NOTHING" the operation on this column will just be aborted, I don`t want to update anything in the tables, as there is no "new" data once it already is insertet once.