3

I would like to insert a record into a table and if the record is already present get its id, otherwise run the insert and get the new record's id.

I will be inserting millions of records and have no idea how to do this in an efficient manner. What I am doing now is to run a select to check if the record is already present, and if not, insert it and get the inserted record's id. As the table is growing I imagine that SELECT is going to kill me.

What I am doing now in python with psycopg2 looks like this:

select = ("SELECT id FROM ... WHERE ...", [...])
cur.execute(*select)
if not cur.rowcount:
    insert = ("INSERT INTO ... VALUES ... RETURNING id", [...])
    cur.execute(*insert)
rid = cur.fetchone()[0]

Is it maybe possible to do something in a stored procedure like this:

BEGIN
    EXECUTE sql_insert;
    RETURN id;
    EXCEPTION WHEN unique_violation THEN
        -- return id of already existing record
        -- from the exception info ?
END;

Any ideas of how optimize a case like this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
Cricri
  • 1,524
  • 3
  • 12
  • 17
  • 2
    It appears that you are attempting to implement an upsert. See http://www.depesz.com/2012/06/10/why-is-upsert-so-complicated/ – Craig Ringer Nov 09 '12 at 14:24
  • possible duplicate of [UPDATE/INSERT based on if a row exists](http://stackoverflow.com/questions/11559420/update-insert-based-on-if-a-row-exists) – Craig Ringer Nov 09 '12 at 14:24
  • 2
    I do not believe it is an upsert. I am not doing any updates. I have millions of duplicate records coming and I need the id of the record if it's already in the database. – Cricri Nov 09 '12 at 14:33
  • For some background: I have a cartesian product of values of ~ 40 mill records which I would like to split into 3 tables. The split causes a lot of duplication. As I am inserting, I will have all duplicates removed. But I would still like to keep the original join of the 3 tables in a 4th table. The join will be on the id of records in the 3 tables. – Cricri Nov 09 '12 at 14:43
  • My mistake, it's not an upsert. It has similar problems, though. – Craig Ringer Nov 10 '12 at 01:24

1 Answers1

2

First off, this is obviously not an UPSERT as UPDATE was never mentioned. Similar concurrency issues apply, though.

There will always be a race condition for this kind of task, but you can minimize it to an extremely tiny time slot, while at the same time querying for the ID only once with a data-modifying CTE (introduced with PostgreSQL 9.1):

Given a table tbl:

CREATE TABLE tbl(tbl_id serial PRIMARY KEY, some_col text UNIQUE);

Use this query:

WITH x AS (SELECT 'baz'::text AS some_col) -- enter value(s) once

   , y AS (
   SELECT x.some_col
        , (SELECT t.tbl_id FROM tbl t WHERE t.some_col = x.some_col) AS tbl_id
   FROM   x    
   )

   , z AS (
   INSERT INTO tbl(some_col)
   SELECT y.some_col
   FROM   y
   WHERE  y.tbl_id IS NULL
   RETURNING tbl_id
)

SELECT COALESCE(
         (SELECT tbl_id FROM z)
        ,(SELECT tbl_id FROM y)
       );
  • CTE x is only for convenience: enter values once.
  • CTE y retrieves tbl_id - if it already exists.
  • CTE z inserts the new row - if it doesn't.
  • The final SELECT avoids running another query on the table with the COALESCE construct.

Now, this can still fail if a concurrent transaction commits a new row with some_col = 'foo' exactly between CTE y and z, but that's extremely unlikely. If it happens you get a duplicate key violation and have to retry. Nothing lost. If you don't face concurrent writes, you can just forget about this.

You can put this into a plpgsql function and rerun the query on duplicate key error automatically.

Goes without saying that you need two indexes in this setup (like displayed in my CREATE TABLE statement above):

  • a UNIQUE or PRIMARY KEY constraint on tbl_id (which is of serial type!)
  • another UNIQUE or PRIMARY KEY constraint on some_col

Both implement an index automatically.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228