SQL: UPDATE FROM JOIN much slower than SELECT INTO FROM JOIN

Question

I'm currently analyzing the Bitcoin blockchain. I have the following tables in my postgresql database:

tx: 11GB
txin: 40GB
txout: 37GB

To speed-up my analysis queries I copy columns from tables. There are indexes on tx.id, txout.tx_id, and txout.txout_id. To copy the column txout_id from table txout to table tx, I executed the following queries:

First query:

    UPDATE tx
    SET txout_id = txout.txout_id
    FROM tx as t1
    INNER JOIN txout
    ON t1.id = txout.tx_id

Second query:

    SELECT tx.*, txout.txout_id 
    INTO tx_txoutID
    FROM tx
    INNER JOIN txout
    ON tx.id = txout.tx_id

I cancelled the first query after 75 minutes. The second query was finished after 20 minutes. The second query requires that I drop tx, rename tx_txoutID and create indexes for tx afterwards. Now I'm wondering whether there's a query, that's as fast as the second and as comfortable as the first query.

Edit: txout_id is originally not a part of tx, but has been added for the first statement by altering the table.

Why do you use tx twice? UPDATE tx SET txout_id = txout.txout_id FROM txout WHERE t1.id = txout.tx_id; — Frank Heikens, Dec 01 '14 at 12:08
I'd suspect that one of the tables doesn't have an applicable index. — Clockwork-Muse, Dec 01 '14 at 12:21
This is a common misconception, as @FrankHeikens pointed out: if you mention the currently updated table in the `WHERE` clause, you create a (mostly unintentional) self-join: http://www.postgresql.org/docs/current/static/sql-update.html#AEN81996 (see parameter at `from_list`) — pozs, Dec 01 '14 at 12:34
@Frank Heikens: Huh, never occurred to me that you don't need a JOIN when updating from another table. I just tested it and the query is still running since 35 minutes. I believe that eventually this is faster than my first query but still way slower than the SELECT INTO JOIN query. — Juergen, Dec 01 '14 at 13:26
Could you show us the results from EXPLAIN ANALYZE? And what is the setting for work_mem? Tuning work_mem could help in this large update, same thing for the checkpoint parameters like checkpoint_segments. — Frank Heikens, Dec 01 '14 at 13:38
I cancelled the query after 65 minutes. Does EXPLAIN ANALYZE also execute the query or just return the execution plan? It's running since 10 minutes. work_mem is default and set to 1mb. A while ago I found a site that generates a postgre.conf file based on the system specifications but I can't find it anymore. I've got 16gb memory. edit: it obviously executes when using the ANALYZE parameter edit2: got it - http://pgtune.leopard.in.ua/ — Juergen, Dec 01 '14 at 14:06
It's normal than an UPDATE is much slower than creating a new table with brand new content. An UPDATE locks and relocates each row individually and has to update each index entry too. In your case with 3 indexes I'd bet on it being 5-10x slower. — Daniel Vérité, Dec 01 '14 at 14:16
@Daniel: That's what I suspected. I will try it with optimized server configuration, though. Is there a reference or do you know the reason, why the indexes have to be updated? The new column is not indexed and I thought the existing indexes remain untouched, as I don't delete/insert. — Juergen, Dec 01 '14 at 14:22
The index points to the row and the row is relocated due to MVCC (=the need for the old values to be kept around). See http://stackoverflow.com/questions/3361291 or http://dba.stackexchange.com/questions/52517 for more but basically you'll see that your method #2 is optimal. — Daniel Vérité, Dec 01 '14 at 14:50
Thanks @DanielVérité. That's the answer I'll go with. After optimizing the server config and first query, the statement still wasn't finished after 30 minutes. The references are very helpful! — Juergen, Dec 01 '14 at 15:10

score 2 · Answer 1 · answered Dec 01 '14 at 12:36

2

Try this query:

UPDATE tx
SET txout_id = txout.txout_id
FROM txout
WHERE tx.id = txout.tx_id

You do not need to add original table to FROM clause. Unless you intend an additional self join.

answered Dec 01 '14 at 12:36

Ihor Romanchenko

26,995
8
48
44

1

That's an answer that helps me optimize the query. It doesn't solve the problem of the UPDATE query being much slower than the SELECT INTO query, though. – Juergen Dec 01 '14 at 14:50

SQL: UPDATE FROM JOIN much slower than SELECT INTO FROM JOIN

1 Answers1