0

Could anyone please tell me which one of the following is more efficient? I have tens of millions of rows to process, and performance is critical.

In the second example, table0 is a temporary table, which seems to be much faster to create than table0 in the first example. (Why?) I couldn't use a temporary table in the first example because the variable row could not be declared before the table is created. (table0 holds all distinct rows of the original table, which is not shown in the code below.)

I guess it'd be a good idea to create hash indices for blah2, blah3, blah4 and blah5 of table1 in the second example, though it would then take longer to write onto the table.


FOR row IN SELECT * FROM table0
LOOP
  IF NOT EXISTS (SELECT 1 FROM table1
                   WHERE blah2 = row.blah2 AND blah3 = row.blah3
                     AND blah4 = row.blah4 AND blah5 = row.blah5) THEN
    INSERT INTO table2
      (blah0, blah1, blah2, blah3, blah4, blah5)
      VALUES (row.blah0, row.blah1, row.blah2, row.blah3, row.blah4, row.blah5);
  END IF;
END LOOP;

INSERT INTO table2
  (blah0, blah1, blah2, blah3, blah4, blah5)
  SELECT blah0, blah1, blah2, blah3, blah4, blah5 FROM table0
    WHERE NOT EXISTS
      (SELECT 1 FROM table1
         WHERE table1.blah2 = table0.blah2
           AND table1.blah3 = table0.blah3
           AND table1.blah4 = table0.blah4
           AND talbe1.blah5 = table0.blah5);
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • It's kind of hard to tell exactly what you're asking from this. You seem to be talking about two different `table0`s created in different ways (how? why?). Also, don't use hash indices, there's little advantage over a b-tree and they aren't well maintained, don't support replication, etc. – Craig Ringer Oct 12 '12 at 10:30
  • If comparing performance use `EXPLAIN ANALYZE`. Looking at those two I'd expect the 2nd to be *massively* faster than the 1st, but you should *measure it on a scaled down sample*, not just guess. – Craig Ringer Oct 12 '12 at 10:34
  • I think the questions are clear enough. – Erwin Brandstetter Oct 12 '12 at 17:57

1 Answers1

1

Questions

In the second example, table0 is a temporary table, which seems to be much faster to create than table0 in the first example. (Why?)

Temporary tables are regularly much faster than regular ones as their content is not persisted to disk. You still have minor disk activity when you create one, as system catalogs receive entries.

Performance with temp tables takes a dive as soon as you run out of temporary buffers because the system starts swapping out pages to disk. You can set temp_buffers at the start of your session to provide more RAM for temp tables. More details in this related answer.

If in the end you want to persist the result somewhere, you might as well use a regular table right away in the second example.

The first example is also much slower, because looping though individual rows is regularly much more expensive than set-based operations with SQL-commands.

I couldn't use a temporary table in the first example because the variable row could not be declared before the table is created. (table0 holds all distinct rows of the original table, which is not shown in the code below.)

Well, you can't use the row type in a function before the table is created. But you can just use an anonymous record instead:

DECLARE
   rec record;
BEGIN
   FOR rec IN SELECT * FROM table0 ...

Alternative solution

But your first approach is no good anyway. Your second approach looks just fine. Should be the fastest way. An alternative would be:

INSERT INTO table2 (blah0, blah1, blah2, blah3, blah4, blah5)
SELECT t0.blah0, t0.blah1, t0.blah2, t0.blah3, t0.blah4, t0.blah5
FROM   table0 t0
LEFT   JOIN table1 t1 USING (blah2, blah3, blah4, blah5)
WHERE  t1.t1_id IS NULL -- or any other column defined NOT NULL

If you have duplicate entries for (blah2, blah3, blah4, blah5) in table1 then your query with EXISTS is probably faster. Else this one with LEFT JOIN / IS NULL might come out on top.

Index and test

Of course an index on (blah2, blah3, blah4, blah5) in table1 would help a lot - adding some cost to write operations in table1. Make that a plain b-tree index (like @Craig already advises in the comment), but I would go for a multi-column index. Cheaper and faster for this case!

Use EXPLAIN ANALYZE to test performance as @Craig advised in the comment - or just EXPLAIN first (without ANALYZE), since your tables seem to be huge and EXPLAIN ANALYZE fake-executes it all to get real times.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Thanks very much, Erwin and Craig. I was trying to make the questions as succinct as possible, but, being a new member, I wasn't sure how to format them nicely. This answer is extremely comprehensive and is exactly what I was hoping to find! Thank you once again for your time! – Vernon Sung Oct 14 '12 at 20:40
  • BTW, since I'm only checking for equalities (not inequalities), would using hash indices be more efficient than using b-tree indices if I ignore maintenance issues? I'm not sure whether that's a good idea, but there seem to be some differences in performance when searching for a row with a matching value. – Vernon Sung Oct 14 '12 at 21:04
  • And I have no duplicate entries for (blah2, blah3, blah4, blah5) in table1 so the alternative solution above may be the fastest, which I guess I should check using EXPLAIN ANALYZE. – Vernon Sung Oct 14 '12 at 21:26
  • @user1740028: As long as you don't deal with huge columns, a b-tree index should be just fine. Make sure the index is used by checking with EXPLAIN. Of course, all the basic advice for performance optimization applies. Check the [Postgres Wiki page](http://wiki.postgresql.org/wiki/Performance_Optimization). It's a little dated by now, but all the basics still apply. – Erwin Brandstetter Oct 14 '12 at 21:44
  • That's a great collection of very useful links. Thank you! – Vernon Sung Oct 14 '12 at 22:00