I have an application that needs to process a "big" amount of data : first step is to load temporarily data into a table from a CSV file (in the described case about 1.7 million rows) and second one is to process updates / inserts on another table based on loaded data through one query :
UPDATE destination_table d
SET ...
FROM temp_table t
WHERE d.column = t.column;
Note : temp_table is emptied after the process is entirely done
We had some performance issues as there were no index on t.column (postgres process was using 100% of the CPU and the query could be stuck for several hours and even days), we added it and query was "only" running in around 5 mins on this specific in our environment which is more than acceptable.
The problem I'm facing is that in the client environments (same RAM / vCPU number), the index doesn't seem to be used. In one of those, the application seemed to work but we are now facing the same problem (query stuck for days with a process using 100% of the CPU) on the same file.
After some checking, I noticed that the index was not scanned in their environment
In ours pg_stat_all_indexes is telling me for this index that :
- idx_scan : 2060888
- idx_tup_read : 3762435
- idx_tup_fetch : 3762432
when in their environment :
- idx_scan : 6
- idx_tup_read : 6
- idx_tup_fetch : 6
I've talked to someone about it and he told me the reason could be that the table has not been vacuumed after the inserts. A process has been launched on 23/03 in their environment (still "running"...) and when i'm looking at the last_autovacuum on the temp_table, it has been done on 20/02 (not 03), only once obviously as the autovacuum_count is "1". He told me that we should "force" the vacuum in the exploitation process but as the inserts in the table and the update are supposed to be consecutive, I don't see how it can work.
Questions :
- Is the vacuum (analyze?) mandatory every time I load data in my temp_table so that my index will be correctly used while performing the following update?
- What could explain the index scan difference between environments?