Postgresql batch insert or ignore

Question

I have the responsibility of switching our code from sqlite to postgres. One of the queries I am having trouble with is copied below.

INSERT INTO group_phones(group_id, phone_name)
SELECT g.id, p.name 
FROM phones AS p, groups as g
WHERE g.id IN ($add_groups) AND p.name IN ($phones);

The problem arises when there is a duplicate record. In this table the combination of both values must be unique. I have used a few plpgsql functions in other places to do update-or-insert operations, but in this case I can do several inserts at once. I am not sure how to write a stored routine for this. Thanks for all the help from all the sql gurus out there!

score 12 · Accepted Answer · edited Jun 20 '20 at 09:12

There are 3 challenges.

Your query has no JOIN condition between the tables phones and groups, making this effectively a limited CROSS JOIN - which you most probably do not intend. I.e. every phone that qualifies is combined with every group that qualifies. If you have 100 phones and 100 groups that's already 10,000 combinations.
Insert distinct combinations of (group_id, phone_name)
Avoid inserting rows that are already there in table group_phones .

All things considered it could look like this:

INSERT INTO group_phones(group_id, phone_name)
SELECT i.id, i.name
FROM  (
    SELECT DISTINCT g.id, p.name -- get distinct combinations
    FROM   phones p
    JOIN   groups g ON ??how are p & g connected??
    WHERE  g.id IN ($add_groups)
    AND    p.name IN ($phones)
    ) i
LEFT   JOIN group_phones gp ON (gp.group_id, gp.phone_name) = (i.id, i.name)
WHERE  gp.group_id IS NULL  -- avoid duping existing rows

Concurrency

This form minimizes the chance of a race condition with concurrent write operations. If your table has heavy concurrent write load, you may want to lock the table exclusively or use serializable transaction isolation, This safeguard against the extremely unlikely case that a row is altered by a concurrent transaction in the tiny time slot between the constraint verification (row isn't there) and the write operation in the query.

BEGIN ISOLATION LEVEL SERIALIZABLE;
INSERT ...
COMMIT;

Be prepared to repeat the transaction if it rolls back with a serialization error. For more on that topic good starting points could be this blog post by @depesz or this related question on SO.

Normally, though, you needn't even bother with any of this.

Performance

LEFT JOIN tbl ON right_col = left_col WHERE right_col IS NULL

is generally the fastest method with distinct columns in the right table. If you have dupes in the column (especially if there are many),

WHERE NOT EXISTS (SELECT 1 FROM tbl WHERE right_col = left_col)

May be faster because it can stop to scan as soon as the first row is found.

You can also use IN, like @dezso demonstrates, but it is usually slower in PostgreSQL.

This works! Amazing to use a left join, and a where condition that looks for nulls in the left table! I'd give extra points if I could for not needing PL/SQL. — jlunavtgrad, Aug 10 '12 at 13:21
Nitpick: it's the *right* table that is checked for `NULL` and the procedural language is `PL/pgSQL` or simply `plpgsql`. `PL/SQL` is Oracle's beast. I added an alternative solution for reference. — Erwin Brandstetter, Aug 10 '12 at 14:19
Thanks for the clarification. It makes sense that joined tables come in on the right. I'll add that I think plpgsql is the worst name they could have chosen. — jlunavtgrad, Aug 10 '12 at 15:54
@jlunavtgrad: They certainly didn't consult anyone with the first idea about advertising. :) — Erwin Brandstetter, Aug 10 '12 at 16:11

dezso · Answer 2 · 2012-08-10T14:00:27.140

2

Try the following:

INSERT INTO group_phones(group_id, phone_name)
SELECT DISTINCT g.id, p.name 
FROM phones AS p, groups as g
WHERE 
    g.id IN ($add_groups) 
    AND p.name IN ($phones)
    AND (g.id, p.name) NOT IN (
        SELECT group_id, phone_name
        FROM group_phones
    )
;

With DISTINCT you can be sure that unique rows will be inserted, and with the NOT IN clause you exclude already existing rows.

NOTE While this solution is probably easier to understand, in most cases Erwin's will perform better.

edited Aug 10 '12 at 14:00

answered Aug 09 '12 at 21:09

dezso

2,894
23
29

I checked this out and it works too. I was trying to come up with something like this but I couldn't figure out the syntax. You make it look easy. – jlunavtgrad Aug 10 '12 at 13:56
One small note, I had to change the last from-clause to group_phones for it to work. – jlunavtgrad Aug 10 '12 at 13:58

Postgresql batch insert or ignore

2 Answers2

Concurrency

Performance

Linked