We're trying to execute a batch insert into Azure Synapse (formerly Azure SQL Data warehouse). Problems are:
- Performance is abysmal (~1 second for insertion of one row of less than 2KB and 20-25 columns)
- It scales linearly (~90 seconds for 100 rows I think)
We're using standard JDBC batch insertion pattern addBatch()
& executeBatch()
with PreparedStatements
(https://stackoverflow.com/a/3786127/496289).
We're using JDBC driver provided by Microsoft.
We know what's wrong, in DB telemetry it's clear that DB is breaking the batch down and more or less running it as if it's in a for-loop. No batch "optimization".
Curiously, when the underlying data source is SQL Server, batch scales as expected.
Question is: Is there nothing in standard/spec that says executeBatch()
should scale better than linearly?
E.g. JDBC™ 4.3 Specification (JSR 221)
says it can improve performance, not it must.
CHAPTER 14 Batch Updates
The batch update facility allows multiple SQL statements to be submitted to a data source for processing at once. Submitting multiple SQL statements, instead of individually, can greatly improve performance. Statement, PreparedStatement, and CallableStatement objects can be used to submit batch updates
14.1.4 PreparedStatement Objects
has no such explicit/implied statement to say batch mechanism is for better performance.
Should probably add that Azure Synapse is capable to loading 1 trillion rows of data (~450 GB in Parquet format) from Data lake in 17-26 minutes with 500 DWUs.