1

Im working on a product that involves large number of upsert operations into a single table.

We are dealing with a time-based data and using timescaledb hypertables with 7 days chunk interval size. we have concurrent tasks that upserts data into a single table, and in extreme cases its possible that we will have 40 concurrent tasks, each one upserting around 250k rows, all to the same table.

Initially we decided to go with the approach of deleting all the old rows and then inserting the updated ones with a COPY FROM statement, but when we got to test the system on large scale these COPYs took long time to finish, eventually resulting in the db's CPU usage to reach 100%, and become unresponsive. We also noticed that the index size of the table increased radically and filled up the disk usage to 100%, and SELECT statements took extremely long time to execute (over 10 minutes). We concluded that the reason for that was large amount of delete statements that caused index fragmentation, and decided to go with another approach.

following the answers on this post, we decided to copy all the data to a temporary table, and then upsert all the data to the actual table using an "extended insert" statement -

INSERT INTO table SELECT * FROM temp_table ON CONFLICT DO UPDATE...; 

our tests show that it helped with the index fragmentation issues, but still large upsert operations of ~250K take over 4 minutes to execute, and during this upsert process SELECT statements take too long to finish which is unacceptable for us.

I'm wondering whats the best approach to create this upsert operation with as low impact to the performance of SELECTs as possible. The only thing that comes in mind right now is to split the insert into smaller chunks -

INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 0 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 50000 ON CONFLICT DO UPDATE ...;
INSERT INTO table SELECT * FROM temp_table LIMIT 50000 OFFSET 100000 ON CONFLICT DO UPDATE ...;
...

but if we batch the inserts, is there any advantage of first copying all the data into a temporary table? will it perform better then a simple multi-row insert statement? and how do i decide whats the best chunk size to use when splitting up the upsert? is using a temporary table and upserting the rows directly from it allows for a bigger chunk sizes?

Is there any better approach to achieve this? any suggestion would be appreciated

Ben Ben Sasson
  • 103
  • 2
  • 8
  • 1
    A single thread that does inserts already puts enough load on your system that your queries are impacted? Perhaps you need a stronger system. – Laurenz Albe Feb 08 '22 at 15:41
  • no single thread, 40 different concurrent tasks (processes), each inserting 250k rows at a time – Ben Ben Sasson Feb 08 '22 at 15:52
  • 1
    Well, then the solution is simple: use fewer threads, if you want to throttle the load. – Laurenz Albe Feb 08 '22 at 16:00
  • i dont want to throttle the load :) i want to optimize it.. find the best way to insert as much rows as possible – Ben Ben Sasson Feb 08 '22 at 16:17
  • The best way to optimize your inserts is to tell your selects to go jump in a lake. You get to decide which one is less objectionable. – jjanes Feb 08 '22 at 18:48
  • Why were you deleting rather than truncating? Were you deleting the entire table or just one partition or by some other criterion that doesn't align with partitions? – jjanes Feb 08 '22 at 19:09
  • deleting by other criterion im afraid.. deleting by a date and since date "partitioning" is handled by timescale's hypertables it cannot be partitioned – Ben Ben Sasson Feb 08 '22 at 19:20

1 Answers1

4

There are a handful of things that you can do to speed up data loading:

  • Have no index or foreign key on the table while you load data (check constraints are fine). I am not suggesting that you drop all your constraints, but you could for example use partitioning, load the data into a new table, then create indexes and constraints and attach the table as a new partition.

  • Load the data with COPY. If you cannot use COPY, use a prepared statement for the INSERT to save on parsing time.

  • Load many rows in a single transaction.

  • Set max_wal_size high so that you get no more checkpoints than necessary.

  • Get fast local disks.

Laurenz Albe
  • 209,280
  • 17
  • 206
  • 263