1

I am attempting to apply 125 different updates to a big database containing 6 tables which have a range of 100k records to 300million records in each table.

Each update contains new data to be inserted into the original 6 tables, however the update also contains data that will be the next version of a record that already existed in the original table. If that is the case then I need to update a field with the update load number. The update data and the original data contain a unique id which is a 20 character varchar which has a standard BTree index on both the original and update tables.

An example of the original data is this

unique_id, version, version_date, change_dates,"tlzb1000001554000601";7;"2003-12-22";"{1995-12-04,1995-12-14,2002-06-21,2002-06-25,2003-12-16}"

And update record would be

unique_id, version, version_date, change_dates,"tlzb1000001554000601";8;"2004-08-10";"{1995-12-04,1995-12-14,2002-06-21,2002-06-25,2003-12-16,2004-07-27}"

As I need to track which update number impacted the record, I have added a update_number to the original data tables which I hoped to update if there was a record with a matching unique_id.

So for each update I have been loading the data into a set of 6 tables that match the schema of my original data and then apply the update so that any record that is being updated I set the updated integer field to the updated number I am processing.

UPDATE original_table 
SET load_number = ${update_number} 
WHERE unique_id IN (SELECT unique_id FROM update_table)

This did not work well and often took over 10hrs per update. After some research I found this advice and so changed my query to use a CTE and 'FROM'

WITH new AS (
    SELECT unique_id 
    FROM update_table
) 
UPDATE original_table o 
SET load_number = ${update_number} 
FROM new n 
WHERE o.unique_id=n.unique_id

Using the above queries I have managed to do 32 updates in a week running 24/7. I have tried to speed it up by temporarily turning off auto_vacuum for the tables.

I have also tried to do load the data deletes into a temp table and then insert them back in with the updated field.

WITH new AS (
    SELECT unique_id FROM update_table
), tmp AS (
    DELETE FROM original_table b 
    USING new n 
    WHERE b.unique_id=n.unique_id 
    RETURNING *)
 INSERT INTO old_data SELECT * FROM tmp

However this seems to take 4x as long.

So I have now exhausted all the variations I can think of so am after some alternatives that I can try.

One possible option I have thought of but not sure how to implement would be to load all the update data into the 6 update tables and have the load_number field set to the update number. Once all 125 updates are done I then use these tables to modify the original tables. But not sure how I would then update the records in the correct order and set the load_number to the correct one.

Hopefully someone has a solution, thanks in advance

Extra Info:- I have a PostgreSQL 9.6 database on a Windows 64bit server with 20 cores and 128Gb of RAM. I have tuned the database based on the wiki tuning advice.

Mike Sherrill 'Cat Recall'
  • 91,602
  • 17
  • 122
  • 185
tjmgis
  • 1,589
  • 4
  • 23
  • 42

1 Answers1

0

To me it seems you are trying to do something equivalent to:

INSERT INTO original_table 
SELECT * FROM update_table 
ON CONFLICT (unique_id) DO UPDATE SET 
    load_number = ${update_number}, 
    version = EXCLUDED.version,
    version_date = EXCLUDED.version_date, 
    change_dates = EXCLUDED.change_dates

PostgreSQL: Documentation: 9.6: INSERT#SQL-ON-CONFLICT

kurkle
  • 353
  • 1
  • 2
  • 6