0

I'm using PostgreSQL and have a table, named table_all, with 5 million lines and this structure:

id|source_id|destination_id|some_flag
1 | 12      |  13          |  NULL
2 | 12      |  14          |  NULL
...

And another table table_flag with the same number of rows and the following structure:

id|some_flag
1 | true
2 | false
...

I need to set some_flag in table_all with the values from table_flag. If I issue a statement like:

UPDATE table_all set some_flag = table_flag.some_flag 
from table_flag 
where table_all.id = table_flag.id

it takes hours, and I need to perform this operation quite often. The two tables are ordered by id, which is unique, and both tables have all ids. Intuitively for me, setting some_flag in table_all with the values from table_flag is only a matter of performing a bulk copy from table_flag to table_all. Is there a way to do this?

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • There's no way to do this as a bulk copy, because the values of `some_flag` aren't stored consecutively. Moreover, Postgres (and most other databases) will have dead tuples in the table. For performance-related questions, please run `EXPLAIN ANALYZE` (http://www.postgresql.org/docs/9.2/interactive/sql-explain.html) and post the output. Since that will take hours (it runs the query), you might start with simple `EXPLAIN` that just shows the plan. This query should run in less than several hours. – Andrew Lazarus May 26 '14 at 22:54
  • How do you know the two tables are ordered by id? Even if they start out that way, they will not stay that way for long if you do a lot of these updates. – jjanes May 27 '14 at 19:10
  • Are targeted rows in `table_all` always NULL as your example suggests? But nothing in your text ... If you perform this often, I assume only a few rows in table_flag have changed? Which rows? Can you define that? Please ***edit*** your question to clarify. – Erwin Brandstetter May 28 '14 at 05:04

3 Answers3

0

If the flag is indeed a boolean column, I'd suggest using a combo of default values and sub-queries, like:

UPDATE table_all SET some_flag = True WHERE id IN (SELECT id FROM table_flag WHERE some_flag=True)

EDIT: Deleting the statement "A non-clustered index on (id,some_flag) on table_all might help." because as Andrew pointed out, indexing a bool column is not a great idea.

Ram RS
  • 278
  • 3
  • 17
  • 1
    Adding a `boolean` to an index is generally not helpful. Indexes work best when the underlying column has a large number of values. – Andrew Lazarus May 26 '14 at 22:46
  • Thank you, Andrew. I never thought of it in those terms. Yes, it is not a great idea indeed. I'll edit my answer and update this piece. – Ram RS May 27 '14 at 02:18
0

If your updates are incremental (e.g. most of flags in the main table are not null), use this query:

update table_all 
set some_flag = table_flag.some_flag 
from table_flag 
where 
    table_all.id = table_flag.id
    and (table_all.some_flag isnull
        or table_all.some_flag <> table_flag.some_flag);

If all (or main part of) flags in the main table are null, you can try to use a cursor:

do $$
declare  
    cur cursor for select id, some_flag from table_flag;
    rec record;
begin
    for rec in cur loop
        update table_all 
        set some_flag = rec.some_flag 
        where 
            table_all.id = rec.id
            and (table_all.some_flag isnull
                or table_all.some_flag <> rec.some_flag);
    end loop;
end $$;
klin
  • 112,967
  • 15
  • 204
  • 232
0

Prevent "empty updates" that do not actually change anything, but are just as expensive, because Postgres need to insert a new row version anyway.

UPDATE table_all a
SET    some_flag = f.some_flag 
FROM   table_flag f
WHERE  a.id = f.id
AND    a.some_flag IS DISTINCT FROM f.some_flag;

For columns that can be NULL, use IS DISTINCT FROM.
For columns defined NOT NULL <> is good enough.

Also, there is no "order" in a database table, you have a basic misconception there. Indexes can help a lot, but not when you have to process all rows anyway.

You should probably change your workflow to be able to select the few rows to be updated to begin with. Your current approach is extremely inefficient, even with the improved query.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228