I am attempting to apply 125 different updates to a big database containing 6 tables which have a range of 100k records to 300million records in each table.
Each update contains new data to be inserted into the original 6 tables, however the update also contains data that will be the next version of a record that already existed in the original table. If that is the case then I need to update a field with the update load number. The update data and the original data contain a unique id which is a 20 character varchar which has a standard BTree index on both the original and update tables.
An example of the original data is this
unique_id, version, version_date, change_dates,"tlzb1000001554000601";7;"2003-12-22";"{1995-12-04,1995-12-14,2002-06-21,2002-06-25,2003-12-16}"
And update record would be
unique_id, version, version_date, change_dates,"tlzb1000001554000601";8;"2004-08-10";"{1995-12-04,1995-12-14,2002-06-21,2002-06-25,2003-12-16,2004-07-27}"
As I need to track which update number impacted the record, I have added a update_number to the original data tables which I hoped to update if there was a record with a matching unique_id.
So for each update I have been loading the data into a set of 6 tables that match the schema of my original data and then apply the update so that any record that is being updated I set the updated integer field to the updated number I am processing.
UPDATE original_table
SET load_number = ${update_number}
WHERE unique_id IN (SELECT unique_id FROM update_table)
This did not work well and often took over 10hrs per update. After some research I found this advice and so changed my query to use a CTE and 'FROM'
WITH new AS (
SELECT unique_id
FROM update_table
)
UPDATE original_table o
SET load_number = ${update_number}
FROM new n
WHERE o.unique_id=n.unique_id
Using the above queries I have managed to do 32 updates in a week running 24/7. I have tried to speed it up by temporarily turning off auto_vacuum for the tables.
I have also tried to do load the data deletes into a temp table and then insert them back in with the updated field.
WITH new AS (
SELECT unique_id FROM update_table
), tmp AS (
DELETE FROM original_table b
USING new n
WHERE b.unique_id=n.unique_id
RETURNING *)
INSERT INTO old_data SELECT * FROM tmp
However this seems to take 4x as long.
So I have now exhausted all the variations I can think of so am after some alternatives that I can try.
One possible option I have thought of but not sure how to implement would be to load all the update data into the 6 update tables and have the load_number field set to the update number. Once all 125 updates are done I then use these tables to modify the original tables. But not sure how I would then update the records in the correct order and set the load_number to the correct one.
Hopefully someone has a solution, thanks in advance
Extra Info:- I have a PostgreSQL 9.6 database on a Windows 64bit server with 20 cores and 128Gb of RAM. I have tuned the database based on the wiki tuning advice.