0

We're trying to execute a batch insert into Azure Synapse (formerly Azure SQL Data warehouse). Problems are:

  • Performance is abysmal (~1 second for insertion of one row of less than 2KB and 20-25 columns)
  • It scales linearly (~90 seconds for 100 rows I think)

We're using standard JDBC batch insertion pattern addBatch() & executeBatch() with PreparedStatements (https://stackoverflow.com/a/3786127/496289).

We're using JDBC driver provided by Microsoft.

We know what's wrong, in DB telemetry it's clear that DB is breaking the batch down and more or less running it as if it's in a for-loop. No batch "optimization".

Curiously, when the underlying data source is SQL Server, batch scales as expected.

Question is: Is there nothing in standard/spec that says executeBatch() should scale better than linearly?

E.g. JDBC™ 4.3 Specification (JSR 221) says it can improve performance, not it must.

CHAPTER 14 Batch Updates

The batch update facility allows multiple SQL statements to be submitted to a data source for processing at once. Submitting multiple SQL statements, instead of individually, can greatly improve performance. Statement, PreparedStatement, and CallableStatement objects can be used to submit batch updates

14.1.4 PreparedStatement Objects has no such explicit/implied statement to say batch mechanism is for better performance.


Should probably add that Azure Synapse is capable to loading 1 trillion rows of data (~450 GB in Parquet format) from Data lake in 17-26 minutes with 500 DWUs.

Community
  • 1
  • 1
Kashyap
  • 15,354
  • 13
  • 64
  • 103
  • Well you did not tell us which performance you observe *without* batch (using single PreparedStatement for INSERT). The main difference using the batch is in the saving of a *roundtrip* for each inserted row. The good news IMHO is, that even if you observe a *repeated single row inserts in the DB*, this can not explain the *one second elapsed time per row*. So there must be an other (probale trivial) explanation. – Marmite Bomber May 05 '20 at 14:40
  • @MarmiteBomber, without batch it is ~1 sec/row. -- `So there must be an other (probale trivial) explanation`: Not quite, we've had extensive tests/debugging and Azure has raised hands saying it's expected. So there are no trivial or non-trivial explanations, it's just the performance. Adding another statement regarding performance at the bottom of question. – Kashyap May 05 '20 at 15:30

1 Answers1

1

The JDBC specification doesn't require any kind of optimization for batch execution. In fact, not all databases support batch execution. A conforming JDBC driver is expected to implement batch execution whether or not the underlying database system supports it.

If the database system doesn't support it, the JDBC driver will simulate batch execution by repeatedly executing the statement in a loop. Such an implementation will not perform better than manually executing the statement repeatedly.

This is also why the text you quote says "can greatly improve performance" and not will or must.

Mark Rotteveel
  • 100,966
  • 191
  • 140
  • 197