4

I'm currently analyzing the Bitcoin blockchain. I have the following tables in my postgresql database:

  • tx: 11GB
  • txin: 40GB
  • txout: 37GB

To speed-up my analysis queries I copy columns from tables. There are indexes on tx.id, txout.tx_id, and txout.txout_id. To copy the column txout_id from table txout to table tx, I executed the following queries:

First query:

    UPDATE tx
    SET txout_id = txout.txout_id
    FROM tx as t1
    INNER JOIN txout
    ON t1.id = txout.tx_id
Second query:
    SELECT tx.*, txout.txout_id 
    INTO tx_txoutID
    FROM tx
    INNER JOIN txout
    ON tx.id = txout.tx_id

I cancelled the first query after 75 minutes. The second query was finished after 20 minutes. The second query requires that I drop tx, rename tx_txoutID and create indexes for tx afterwards. Now I'm wondering whether there's a query, that's as fast as the second and as comfortable as the first query.

Edit: txout_id is originally not a part of tx, but has been added for the first statement by altering the table.

Juergen
  • 312
  • 3
  • 18
  • 3
    Why do you use tx twice? UPDATE tx SET txout_id = txout.txout_id FROM txout WHERE t1.id = txout.tx_id; – Frank Heikens Dec 01 '14 at 12:08
  • I'd suspect that one of the tables doesn't have an applicable index. – Clockwork-Muse Dec 01 '14 at 12:21
  • 2
    This is a common misconception, as @FrankHeikens pointed out: if you mention the currently updated table in the `WHERE` clause, you create a (mostly unintentional) self-join: http://www.postgresql.org/docs/current/static/sql-update.html#AEN81996 (see parameter at `from_list`) – pozs Dec 01 '14 at 12:34
  • @Frank Heikens: Huh, never occurred to me that you don't need a JOIN when updating from another table. I just tested it and the query is still running since 35 minutes. I believe that eventually this is faster than my first query but still way slower than the SELECT INTO JOIN query. – Juergen Dec 01 '14 at 13:26
  • 1
    Could you show us the results from EXPLAIN ANALYZE? And what is the setting for work_mem? Tuning work_mem could help in this large update, same thing for the checkpoint parameters like checkpoint_segments. – Frank Heikens Dec 01 '14 at 13:38
  • I cancelled the query after 65 minutes. Does EXPLAIN ANALYZE also execute the query or just return the execution plan? It's running since 10 minutes. work_mem is default and set to 1mb. A while ago I found a site that generates a postgre.conf file based on the system specifications but I can't find it anymore. I've got 16gb memory. edit: it obviously executes when using the ANALYZE parameter edit2: got it - http://pgtune.leopard.in.ua/ – Juergen Dec 01 '14 at 14:06
  • 1
    It's normal than an UPDATE is much slower than creating a new table with brand new content. An UPDATE locks and relocates each row individually and has to update each index entry too. In your case with 3 indexes I'd bet on it being 5-10x slower. – Daniel Vérité Dec 01 '14 at 14:16
  • @Daniel: That's what I suspected. I will try it with optimized server configuration, though. Is there a reference or do you know the reason, why the indexes have to be updated? The new column is not indexed and I thought the existing indexes remain untouched, as I don't delete/insert. – Juergen Dec 01 '14 at 14:22
  • 2
    The index points to the row and the row is relocated due to MVCC (=the need for the old values to be kept around). See http://stackoverflow.com/questions/3361291 or http://dba.stackexchange.com/questions/52517 for more but basically you'll see that your method #2 is optimal. – Daniel Vérité Dec 01 '14 at 14:50
  • Thanks @DanielVérité. That's the answer I'll go with. After optimizing the server config and first query, the statement still wasn't finished after 30 minutes. The references are very helpful! – Juergen Dec 01 '14 at 15:10

1 Answers1

2

Try this query:

UPDATE tx
SET txout_id = txout.txout_id
FROM txout
WHERE tx.id = txout.tx_id

You do not need to add original table to FROM clause. Unless you intend an additional self join.

Ihor Romanchenko
  • 26,995
  • 8
  • 48
  • 44
  • 1
    That's an answer that helps me optimize the query. It doesn't solve the problem of the UPDATE query being much slower than the SELECT INTO query, though. – Juergen Dec 01 '14 at 14:50