1

So I have recently started working with postgres. And since I cannot perform an "IF else" unless in a function I am curious which is better resources wise and performance wise. 1. Create a function like the following to query the database storing the id. Check the id to see if its null if it is do insert else return the id;

CREATE OR REPLACE FUNCTION getallfoo() RETURN INT AS
$BODY$
DECLARE
id INTEGER := (SELECT id FROM foo where (lots of condition checks 15+);
BEGIN
IF id is null then
   INSERT INTO foo(a,b,c,d,e,f) SELECT (1,2,3,4,5,6) RETURNING id INTO id;
   return id;
ELSE 
  RETURN id;
END IF;
END;
$BODY$
Language 'plpgsql';
select * from getallfoo();

Or 2: To first attempt an insert into the database doing something like the following. Followed by querying the database asuming the insert fails. I have a reason for doing the insert first thats not the point of this question. I know most would say to query first then if doesnt exist insert.

INSERT INTO foo (a,b,c,d,e,f, ........, 15+) SELECT (1,2,3,4,5,6, ........., 15+) 
WHERE NOT EXIST(SELECT 1 from foo where a=1, b=2, c=3, up to 15);

Since i'm working in java i would just check if rs.getGeneratedKeys() = 0; if the insert fails i would then requery the database for the id;

SELECT id from foo where a=1, b=2, c=3, ect;

So the question is because my condition checks in my query are lengthy is it better to create a function and query the database once. or to just skip the function and query it twice. I am doing these types of inserts on about 50 different tables.

Side question would be can you do create a function in java using a preparedStatement(the above function) ResultSet rs = ps.executeUpdate();

Jeremy
  • 935
  • 5
  • 18
  • 33
  • 2
    In general, querying less means faster execution time. If you can get that performance boost, it is worth it. – Christian Stewart Jul 03 '13 at 17:42
  • Can you elaborate on 'a reason for doing the insert first thats not the point of this question'? IMO this is an anti-pattern... – home Jul 03 '13 at 17:45
  • I'm working with a testbed where I'm the only 1 inserting data into the database. There are 2 reasons for checking if the record exist already 1 is some of my data is a many to many relationship and so the record could already exist however unlikely. Second is if someone was to attempt to reparse files that contains the data loading it into the database which wont happen since i'm the only 1 interacting with the insertion of data into the database. In general its more likely the data doesnt exist than it does exist. – Jeremy Jul 03 '13 at 17:47
  • Have you looked into the various ways of performing an "upsert" ("update or insert") to see if any can be adapted to your situation? There are various possible solutions on this question: http://stackoverflow.com/questions/1109061/insert-on-duplicate-update-postgresql – IMSoP Jul 03 '13 at 18:07
  • I looked at the upsert but from what i could tell it required me knowing the id going into the insert and i use a sequence to generate my id's. – Jeremy Jul 03 '13 at 19:57

2 Answers2

2

Hit the database multiple times for same data is not advisable to use because its degrades the performance of application. So if you can perform your task in single function instead of querying DB multiple times then you should use function.

0

As usual, "it depends".

If your connection to the DB is low latency (local host, say) then the overhead of firing a PL/PgSQL procedure becomes non-trivial and it might be faster to launch separate queries.

Benchmark, don't guess.

In this case I suspect that doing it in PL/PgSQL will win, but not purely for performance reasons. Your code is subject to several race conditions. What happens if two people run "getallfoo()" at the same time?

  1. Tx 1 does INSERT ... WHERE
  2. Tx 2 does INSERT ... WHERE
  3. Tx 1's WHERE condition runs the subquery and finds no rows
  4. Tx 2's WHERE condition runs the subquery and finds no rows
  5. Tx 1 INSERTs the row
  6. Tx 2 INSERTs the row
  7. Tx 1 COMMITs and returns the ID 1 Tx 2 COMMITs and returns the ID

If you have a UNIQUE constraint on a business key (ie: not a using a generated primary key or also have a constraint) the second INSERT will fail with an error. If you don't, you'll get two copies of the row.

Putting things in a procedure doesn't help; not even a single statement is atomic in execution. Uncorrelated subqueries run before the outer query, etc. MVCC visibility rules usually mean you don't notice or care about this, but in this case they just mean that your race condition exposure is a bit wider.

To handle this correctly you need a re-try loop, like that found in the closely related upsert example for PL/PgSQL in the documentation. You can do this in your application or in a procedure, but doing it in a procedure should narrow the race condition window significantly.

Because you want to do multiple keys at once this is significantly more complicated to handle. You can have races where one transtion is trying to select/insert 1 2 3 4 5 while another wants 4 5 6 7 8. The first tx will successfully insert 1, 2 and 3... then find that the other tx has already inserted 4 and roll back the whole change when it gets a duplicate key error. It won't see that 4 has already been inserted as the second tx won't have committed yet and the row won't be visible.

I'd recommend doing one key at a time, or having an outer procedure that inserts each key one-by-one using a sub-procedure that does the error handling and the retry loop.

All sound painfully complicated? Yep. PostgreSQL really could use some built-in SQL extensions to help with this.

Craig Ringer
  • 307,061
  • 76
  • 688
  • 778