7

I have a CSV file with 50000 entries which I want to import in SQL using batch in JDBC.

What should be the optimal batch size for it?

Andrzej Sydor
  • 1,373
  • 4
  • 13
  • 28
Shivam Agrawal
  • 85
  • 1
  • 1
  • 6

2 Answers2

7

According to Oracle official recommendations, optimal batch size is between 50 and 100

Proof: https://docs.oracle.com/cd/E11882_01/java.112/e16548/oraperf.htm#JJDBC28754

Oracle recommends that you use JDBC standard features when possible. This recommendation applies to update batching as well. Oracle update batching is retained primarily for backwards compatibility.

For both standard update batching and Oracle update batching, Oracle recommends you to keep the batch sizes in the general range of 50 to 100. This is because though the drivers support larger batches, they in turn result in a large memory footprint with no corresponding increase in performance. Very large batches usually result in a decline in performance compared to smaller batches.

Have a nice day

Alex
  • 153
  • 11
1

50k records is not a large dataset. Bigger batch size will help but if you assume that your database server network round trip is 10 ms:

  1. Batch size 50 => 50,000 rows / 50 batch size * 10 ms latency = 10000 ms latency overhead = 10 sec of latency overhead

  2. Batch size 100 => 50,000 rows / 100 batch size * 10 ms latency = 5000 ms latency overhead = 5 sec of latency overhead

Start by setting a a reasonable batch size for the batch insert statements, and then measure how long it actually takes to insert the rows Remember to vacuum after the bulk insert.

If 50k records take 1 minute to insert you need to focus on optimizing the insertion process and not the JDBC batch size since only fraction of total time is spent in the latency overhead.

For larger data sets you should not be using JDBC. There are tools designed for the bulk insertion task e.g. Oracle has SQL*Loader.

Karol Dowbecki
  • 43,645
  • 9
  • 78
  • 111
  • 1
    It also matters if the entire dataset is inserted with the same SQL and if the values bound to each parameter are always the same. If sometimes you bind a DATE to parameter 1 and other times a TIMESTAMP this will slow things down a bunch. Every value bound to a given parameter should always be the same type. The driver will work if it's not, but it will be a lot slower. Similarly if a batch contains more than one SQL. Each batch should be created with just one SQL, one PreparedStatement. Again it will work if you don't but it will be a lot slower. – Douglas Surber Feb 26 '21 at 21:03