1

I have Java code for inserting into Redshift like this:

String query = "INSERT INTO table (id, name, value) VALUES (?, ?, ?)";
PreparedStatement ps = connection.prepareStatement(query);            
for (Record record : records) {
    ps.setInt(1, record.id);
    ps.setString(2, record.name);
    ps.setInt(3, record.value);
    ps.addBatch();
}
ps.executeBatch();

records contains a few thousands items. I tried to run this code with Postgres - it inserted all of them almost instantly while with Redshift it takes 10+ minutes. After that I rewrote it to the next code:

String query = "INSERT INTO table (id, name, value) VALUES ";
for (Record record : records) {
    query += "(" + record.id + ",'" + record.name + "'," + record.value + "),";
}
query = query.substring(1, query.length() - 1);
PreparedStatement ps = connection.prepareStatement(query);
ps.executeUpdate();

And this fixed performance. This code works fine for both Redshift and Postgres. My question is next - what's wrong with the first code snapshot and how I can fix it? (I assume that first code snapshot for Redshift simply ignores batching.)

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
pacman
  • 797
  • 10
  • 28
  • Your second version is prone to SQL injection and for that reason should be avoided. The first version is the canonical way to do this AFAIK; I have no idea why it is so slow. – Tim Biegeleisen Dec 09 '22 at 11:10
  • @TimBiegeleisen I'm aware that second snapshot is unsafe, but somehow it works fast while the firs takes ages - a few seconds vs 30 minutes. – pacman Dec 09 '22 at 11:29
  • Do you have auto-commit enabled? If yes, disable and try with an explicit transaction. – kdgregory Dec 09 '22 at 12:14
  • Other than that, single-row inserts in Redshift tend to have poor performance compared to Postgres, because the storage engines are different. When uploading to Redshift you typically store a file on S3 and use the COPY command -- although iirc there were some changes announced at the most recent re:invent that may have improved performance. – kdgregory Dec 09 '22 at 12:16
  • @kdgregory thanks for the comments, I do have auto-commit disabled. I have just found that adding `reWriteBatchedInserts=true;reWriteBatchedInsertsSize=1024` to JDBC url reduces write time from 30 minutes to 40 seconds but in Postgres it is like 1-2 seconds. Regarding S3 sink - I know about it, but now I am limited to use jdbc. – pacman Dec 09 '22 at 12:35
  • at least now I know that batching may be forced somehow in Redshift – pacman Dec 09 '22 at 12:35
  • I didn't know about those parameters. I recommend bumping the second one to 8192 or even larger. – kdgregory Dec 09 '22 at 12:53
  • @kdgregory already tried - it didn't improve insertion time a lot – pacman Dec 09 '22 at 12:58

1 Answers1

2

Inserting single rows multiple times is never a good plan on a columnar database. Postgres is row-based and Redshift is column-based.

Each INSERT on Postgres just makes another row but on Redshift each insert requires that the column data is read, one element is added, and then the column written back. While Redshift doesn't work on the entire column, just the last 1MB block, it has to read this block for each INSERT.

Also since Redshift is clustered and the data for your table is distributed around the cluster, each subsequent INSERT is accessing a different block on a different slice of the table. All these INSERTs that are accessing single slices of the cluster are serialized by the single-threaded nature of your code so each access to a single slice has to complete before the next INSERT can be issued.

You second code adds lots of rows of data into a single INSERT statement which is compiled and the data is sent to all slices of the database where only data for each slice is stored and rest discarded. This uses the parallelism of Redshift and only has to open the 1MB block on each slice once. However, there are still performance and scalability issues with this approach (common to approach #1 but not as bad). All the data is being sent through the query compiler and then on to every slice. This can slow down compile time and waste network bandwidth. All the data has to flow through the leader node which is responsible for many database functions and doing this for large amounts of data can lead to significant cluster wide performance issues. The amount of data you can insert in this manner is limited by the size (in characters) of the max query length (16MB). There is more but I'll stop there. While this approach is better, from Redshift's point of view, it is far from ideal.

Bottom line - Postgres is a single instance (scale-up), row-based, OLTP database designed for single row inserts and Redshift is a clustered (scale-out), column-based, OLAP database designed for parallel bulk inserts. The COPY command causes each compute node in Redshift to connect to S3 to read component files of the input data. This allows for parallel actions by the Redshift cluster, independent network access to S3, and parallel processing of the read data. (If you really want your code to run fast make it multi-threaded and write your S3 files in parallel then issue a COPY command to Redshift.)

Bill Weiner
  • 8,835
  • 2
  • 7
  • 18
  • thanks for the great answer! I tried to use multi-row insert with PreparedStatement similarly to this answer - https://stackoverflow.com/a/46813180 - it worked much faster but I faced another limit - limit of bind variables per statement so I needed to range SQL like this `insert ... into values (? ... ?).... (? ...?)`. But I'm not sure if formatting complex json, uploading it to s3 will bring lots of benefits for me. – pacman Dec 10 '22 at 03:53
  • How much data are you loading? MBs? GBs? The bigger the data load you are working with, the larger the benefit of heading down the S3 path. Also your data *seems* to be simple structured data (rows and columns) - why not use CSV? It is more data efficient than json. You say "complex json" so it may be that your real data is more complex than the simple examples you have posted. – Bill Weiner Dec 10 '22 at 05:32
  • @BillWeiner - asking as an interested bystander: does it make sense to use [CopyManager](https://jdbc.postgresql.org/documentation/publicapi/org/postgresql/copy/CopyManager.html) as a halfway measure between individual inserts and uploading to S3? You wouldn't get the performance benefits of parallel load, but for a small number of rows (1000s) I would hope that Redshift can internally buffer the entire operation (or at least a significant chunk of it). – kdgregory Dec 10 '22 at 10:18
  • Haven't used that class so grain of salt time. It looks to be postgres specific and uses the "sql - COPY FROM STDIN statement" which is not a thing in Redshift. If you want to bring data into Redshift via the jdbc connection you need to "insert ... values (...),(...),...". This isn't ideal for Redshift (see above). If you do need to do this for small amounts of data combine at least 100 rows of data into each statement (stay below the 64MB limit). Single row inserts repeated millions of times is a Redshift anti-pattern, don't do it. – Bill Weiner Dec 10 '22 at 17:27