0

I have to update multiple records in a table with the most efficient way out there having the least latency and without utilising CPU extensively. At a time records to update can be ranged from 1 to 1000.

We do not want to lock the database when this update occurs as other services are utilising it.

Note: There are no dependencies generated from this table towards any other table in the system.

After looking in many places I've drilled down a few ways to do the task-

  1. simple-update: A simple update query to the table with update command with already known id's
    1. Either multiple update queries (one query for each individual record), or
    2. Usage of update ... from clause as mentioned here as a single query (one query for all records)
  2. delete-then-insert: Firstly, delete the outdated data and then insert updated data with new id's (since there is no dependency on records, new id's are acceptable)
  3. insert-then-delete: Firstly, insert updated records with new id's and then delete outdated data using old id's (since there is no dependency on records, new id's are acceptable)
  4. temp-table: Firstly, insert updated records into a temporary table. Secondly, update the original table with inserted records from the temporary table. At last, remove the temporary table.
    1. We must not drop the existing table and create a new one in its place
    2. We must not truncate the existing table because we have a huge number of records that we cannot store in the buffer memory

I'm open to any more suggestions.

Also, what will be the impact of making the update all at once vs doing it in batches of 100, 200 or 500?

References:

kartoon
  • 1,040
  • 1
  • 11
  • 24
  • Why do you think UPDATE is not efficient? And how do you measure the efficiency? – Frank Heikens May 30 '22 at 07:54
  • I never mentioned that `update` is not efficient, I simply want to know how does it behave in comparison with other ways of updating the records. I'm measuring efficiency in terms of CPU utilisation and time taken to update. – kartoon May 30 '22 at 07:57
  • 1
    You are the only one who can test the updates on your data on your server, while you also have your other server load. You can play with the FILLFACTOR (and other server settings), be selective with the usage of indexes (that also need to be updated), etc. UPDATE ... FROM ... would be my starting point, to keep things simple and standardised. – Frank Heikens May 30 '22 at 08:53
  • Only (1). (2) and (3) do too much work, and (4) is even beyond that. – wildplasser May 30 '22 at 11:13

1 Answers1

0

As mentioned by @Frank Heikens in the comments, I'm sure that different people will have different statistics based on their system design. I did some checks and I have found some insights to share for one of my development systems.

Configurations of the system used:

  • AWS
  • Engine: PostgreSQL
  • Engine version: 12.8
  • Instance class: db.m6g.xlarge
  • Instance vCPU: 4
  • Instance RAM: 16GB
  • Storage: 1000 GiB

I used a lambda function and pg package to write data into a table (default FILLFACTOR) that contains 34,09,304 records. Both lambda function and database were in the same region.

UPDATE 1000 records into the database with a single query

Run Time taken
1 143.78ms
2 115.277ms
3 98.358ms
4 98.065ms
5 114.78ms
6 111.261ms
7 107.883ms
8 89.091ms
9 88.42ms
10 88.95ms

UPDATE 1000 records into the database with a single query in 2 batches of 500 records concurrently

Run Time taken
1 43.786ms
2 48.099ms
3 45.677ms
4 40.578ms
5 41.424ms
6 44.052ms
7 42.155ms
8 37.231ms
9 38.875ms
10 39.231ms

DELETE + INSERT 1000 records into the database

Run Time taken
1 230.961ms
2 153.159ms
3 157.534ms
4 151.055ms
5 132.865ms
6 153.485ms
7 131.588ms
8 135.99ms
9 287.143ms
10 175.562ms

I did not proceed to check for updating records with the help of another buffer table because I had found my answer.

I've seen the database metrics graph provided by the AWS and by looking into those it was clear that DELETE + INSERT was more CPU intensive. And from the statistics shared above DELETE + INSERT took more time as compared to UPDATE.

If updates are done concurrently in batches, yes, updates will be faster, depending on the number of connections (a connection pool is recommended).

Using a buffer table, truncate, and other methods might be more suitable approaches if needed to update almost all the records in a giant table, though I currently do not have metrics to support this. However, for a limited number of records, UPDATE is a fine choice to proceed with.

Also, if not executed properly, please be mindful that if DELETE + INSERT fails, you might lose records and if INSERT + DELETE fails you might end up having duplicate records.

kartoon
  • 1,040
  • 1
  • 11
  • 24