2

I have a csv file with more than 1 Million records. I want to do some processing on these records and persist all records in DB.

I tried few options like.

  1. Save all entities in one GO

    jpaepository.save(entities);

This method takes forever and never compete. works good for smaller no of records.

  1. Save all Entities one by one

    entities.forEach(jpaRepository::save);

This method completes but takes hell lot of time and memory usage is on the sky.

Pramod
  • 387
  • 1
  • 7
  • 19
  • What is the underlying database? Are you sure database is not the bottleneck? – Subir Kumar Sao Sep 15 '17 at 10:51
  • Use an ETL Tool like Pentaho Data Integration. Seriously, you will save yourself a lot of headache. – fhossfel Sep 15 '17 at 10:53
  • 2
    Don't use JPA for that. Use `LOAD DATA INFILE` instead. https://stackoverflow.com/questions/14127529/mysql-import-data-from-csv-using-load-data-infile. That will bring the execution time down to a few seconds (if even a second) – baao Sep 15 '17 at 10:55
  • You should definitely use a bulk data loader utility for this. – Mick Mnemonic Sep 15 '17 at 11:00
  • 1
    and if you reaaaaally want to use jpa you should do batch insert and flush every tot entities saved – Zeromus Sep 15 '17 at 11:00
  • @Subir Its PostGres DB . Local dedicated instance. But still i shall verify weather there is any issue from DB side – Pramod Sep 15 '17 at 11:04
  • @mick Any utility you recommend ? – Pramod Sep 15 '17 at 11:05
  • @baao I have to perform some business logic before saving the data to DB. – Pramod Sep 15 '17 at 11:07
  • [How to import CSV into PostgreSQL](https://stackoverflow.com/questions/2987433/how-to-import-csv-file-data-into-a-postgresql-table) – Mick Mnemonic Sep 15 '17 at 11:16
  • Just want to add that my whole process is Transactional. Looks like because of transaction enabled. entities are not getting Garbage collected and my application goes OOM after some time. If I remove @Transactional then everything works fine – Pramod Sep 17 '17 at 04:48

1 Answers1

3

Here is what I would recommend, based just on your question -

  1. Create a service that reads the file, say FileReaderService
  2. Create a services that writes a set number of records say 1000 at a time, let us call it StorageService.Inject this into FileReaderService
  3. Put @Transactional annotation on the save_N_records method.
  4. Repeatedly call StorageService.save_N_records from FileReaderService.Each time you call it make sure you write a log to monitor progress.
  5. If it is at all possible, I would disable indexing on the table, so inserts are faster, then turn it back on when I am done inserting. Of course, this is never possible on an on-line system, only on off-line reporting systems. Hope this helps!
Prashant
  • 1,002
  • 13
  • 29
  • 1
    I want this process to be in one transaction. if I put Transactional on save_N_records then this method will keep flushing the records to DB and if there is any failure how do I rollback complete transaction ? – Pramod Sep 18 '17 at 02:44
  • @Pramod, If you want the whole thing to be a single transaction, it will be hard to do so without a large redo log, and hence, speed-up. I would recommend you try to break it up in multiple transactions. Good luck with your project. – Prashant Sep 18 '17 at 21:25
  • Great piece of advice. Thanks. – HopeKing Jan 26 '18 at 05:00