0

I have an application that needs to process a "big" amount of data : first step is to load temporarily data into a table from a CSV file (in the described case about 1.7 million rows) and second one is to process updates / inserts on another table based on loaded data through one query :

UPDATE destination_table d
SET ...
FROM temp_table t
WHERE d.column = t.column;

Note : temp_table is emptied after the process is entirely done

We had some performance issues as there were no index on t.column (postgres process was using 100% of the CPU and the query could be stuck for several hours and even days), we added it and query was "only" running in around 5 mins on this specific in our environment which is more than acceptable.

The problem I'm facing is that in the client environments (same RAM / vCPU number), the index doesn't seem to be used. In one of those, the application seemed to work but we are now facing the same problem (query stuck for days with a process using 100% of the CPU) on the same file.

After some checking, I noticed that the index was not scanned in their environment

In ours pg_stat_all_indexes is telling me for this index that :

  • idx_scan : 2060888
  • idx_tup_read : 3762435
  • idx_tup_fetch : 3762432

when in their environment :

  • idx_scan : 6
  • idx_tup_read : 6
  • idx_tup_fetch : 6

I've talked to someone about it and he told me the reason could be that the table has not been vacuumed after the inserts. A process has been launched on 23/03 in their environment (still "running"...) and when i'm looking at the last_autovacuum on the temp_table, it has been done on 20/02 (not 03), only once obviously as the autovacuum_count is "1". He told me that we should "force" the vacuum in the exploitation process but as the inserts in the table and the update are supposed to be consecutive, I don't see how it can work.

Questions :

  • Is the vacuum (analyze?) mandatory every time I load data in my temp_table so that my index will be correctly used while performing the following update?
  • What could explain the index scan difference between environments?
alushiva
  • 11
  • 1
  • Should be more focused – MMG Mar 27 '20 at 10:09
  • What do you mean ? – alushiva Mar 27 '20 at 10:29
  • You should say what you want clearly – MMG Mar 27 '20 at 10:30
  • Edited, does it seem clearer ? – alushiva Mar 27 '20 at 10:39
  • Clearly ask one thing, for example see here: https://stackoverflow.com/questions/60462748/not-working-onbackpressed-when-setcancelable-of-alertdialog-is-false – MMG Mar 27 '20 at 10:41
  • Well i clearly asked two questions, I could have just written : "I have a 2 million row table with an index on it that's not used when i perform an update using that column, why ?" to be asked to give more details afterwards... – alushiva Mar 27 '20 at 10:46
  • OK, I retracted my flag – MMG Mar 27 '20 at 10:50
  • 1
    Yes, an `analyze` is necessary after bulk changing/loading a table (temp or not) if you want accurate statistics immediately after the bulk change –  Mar 27 '20 at 11:14
  • @a_horse_with_no_name Should I therefore run a vacuum analyze right after the loading step ? The colleague that informed me about vacuum also told me that the query could possibly not succeed and that it shouldn't be triggered by my backend application. – alushiva Mar 27 '20 at 11:22

1 Answers1

0

when i'm looking at the last_autovacuum on the temp_table, it has been done on 20/02 (not 03), only once obviously as the autovacuum_count is "1".

I didn't think temp tables were ever autovacuumed. Is this an actual TEMP table, or a permanent table you manually remove rows from?

He told me that we should "force" the vacuum in the exploitation process but as the inserts in the table and the update are supposed to be consecutive, I don't see how it can work.

I don't understand the problem. Insert a VACUUM ANALYZE temp_table statement between the INSERT and the UPDATE statements. (Or maybe just an ANALYZE).

Is the vacuum (analyze?) mandatory every time I load data in my temp_table so that my index will be correctly used while performing the following update?

Yes, it is mandatory to do the ANALYZE if you want the query to be planned using the correct statistics. This doesn't guarantee it will use the plan you want it to, of course.

What could explain the index scan difference between environments?

Probably the above, but seeing the output of 'EXPLAIN (ANALYZE, BUFFERS)` when it is not using the index could certainly help nail it down.

jjanes
  • 37,812
  • 5
  • 27
  • 34
  • 1. To be clear : It's not a real temporary table but a "normal" table where data are stored temporarily 2. I don't know either, it's why I asked, his sentence was "Don't run it (vacuum analyse) from the code itself. It could never end : that's more an exploitation procedure" 3. Ok, so I guess I'm gonna try to do it anyway 4. Doing the explain can be a bit complicated as there are actually multiple queries in one transcation launched in the so called "2nd part" of my process but it's always this specific update (the one related to the index usage) that is getting stuck. – alushiva Mar 27 '20 at 14:12