0

I am attempting to improve the performance of my application in which one of the operations is to read data from a CSV file and store the values from each row as one POJO (so 1500 CSV rows = 1500 POJOs) in a PostgresSQL database. It is a spring boot application and uses a JpaRepository with (default configurations) as the means for persistence. My original attempt was basically this statement in each iteration of the loop as it read each row in the CSV file:

autowiredRepoInstance.save(objectInstance);

However with the spring.jpa.show-sql=true setting in the application.properties file, I saw that there was one insert being done for each POJO. My attempt at improving the performance was to declare an ArrayList outside the loop, save each instance of the POJO in that list within the loop, and at every 500th item, perform a save, as below (ignoring for now the cases where there are more/less than multiples of 500):

loop(
objList.add(objectInstance);


    if (objList.size() == 500) {
                autowiredRepoInstance.save(objList);
                    objList.clear();
      }
  )

However, this was also generating individual insert statements. What settings can I change to improve performance? Specifically, I would like to minimize the number of SQL statements/operations, and have the underlying Hibernate use "multirow" inserts that postgresql allows:

https://www.postgresql.org/docs/9.6/static/sql-insert.html

But any other suggestions are also welcomed.

Thank you.

ITWorker
  • 965
  • 2
  • 16
  • 39
  • @a_horse_with_no_name I updated the reference. I am using 9.6. – ITWorker Jul 02 '18 at 15:03
  • 1
    The best performance you get if you write your own `INSERT .. SELECT`, but an acceptable solution would be to use of [Hybernate Batch Insert](https://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-inserts). This is similar concept to the *multirow* insert. The optimal **batch size** you must examine while testing in your environment. – Marmite Bomber Jul 02 '18 at 15:24
  • 1
    You should use batch insert: https://stackoverflow.com/a/50882952 – Cepr0 Jul 02 '18 at 15:28
  • @Cepr0 I looked at your answer but for some reason my repo does not have a `saveAll` method even though it is also a `JpaRepository`. – ITWorker Jul 02 '18 at 15:50
  • 1
    @ITWorker `saveAll()` method starts from Spring Boot 2.0. In the prev. version you can use `save()` method. – Cepr0 Jul 02 '18 at 15:52
  • @Cepr0 got it. I added `spring.jpa.properties.hibernate.jdbc.batch_size=500` and `?reWriteBatchedInserts=true` to my properties, and there was a great improvement (and I did see 8 batches were created by JDBC as per `spring.jpa.properties.hibernate.generate_statistics=true`) however, I still see individual hibernate insert statements. Why is this the case? Is it actually doing a multirow insert on the postgres side even though it printed individually? – ITWorker Jul 02 '18 at 16:03
  • @ITWorker without seeing your project it's difficult to say about those individual statement. If you run my [example](https://github.com/Cepr0/sb-jpa-batch-insert-demo) you will see only batch inserts in the app log... – Cepr0 Jul 02 '18 at 16:55
  • @Cepr0 if possible could you please run your project with `spring.jpa.show-sql=true` in your properties file? Maybe it's just a difference in the way jdbc and hibernate do logging. – ITWorker Jul 02 '18 at 16:59
  • @ITWorker it's not necessary because in my project I use jdbc logger `log4jdbc-spring-boot-starter` ;) – Cepr0 Jul 02 '18 at 17:01

1 Answers1

0

First read all data from CSV and process like below

  1. Generate a bufferred stream over Input file

  2. Generate a stream over buffered reader apply filer or map to process data

  3. As output of above you will get list of entities

  4. Divide list of entities into list of list entities (if you have huge data like more than a million records)

  5. Pass inner list of entities (you can set 10000) JPA repository save method in batches (if possible use parallel stream)

  6. I processed 1.3 million records in less than a minutes with above process

Or use some batch processing technologies

David Buck
  • 3,752
  • 35
  • 31
  • 35