0

I'm working on a process that checks and updates data from Oracle database. I'm using hibernate and spring framework in my application.

The application reads a csv file, processes the content, then persiste entities :

public class Main() {
    Input input = ReadCSV(path);
    EntityList resultList = Process.process(input);
    WriteResult.write(resultList);
    ...
}

// Process class that loops over input
public class Process{
public EntityList process(Input input) :
   EntityList results = ...;
   ...
   for(Line line : input.readLine()){
     results.add(ProcessLine.process(line))
   ...
   }
   return results;
}
// retrieving and updating entities
Class ProcessLine {
@Autowired
DomaineRepository domaineRepository;
@Autowired
CompanyDomaineService companydomaineService
@Transactional
public MyEntity process(Line line){
  // getcompanyByXX is CrudRepository method with @Query that returns an entity object
  MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode());
  MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());
  if(companyToDetach == null || companyToAttach == null){
     throw new CustomException("Custom Exception");
  }
  // AttachCompany retrieves some entity  relationEntity, then removes companyToDetach and adds CompanyToAttach. this updates relationEntity.company attribute.
  companydomaineService.attachCompany(companyToAttach, companyToDetach);
  return companyToAttach;
}
}

public class WriteResult{
@Autowired
DomaineRepository domaineRepository;
@Transactional
public void write(EntityList results) {
   for (MyEntity result : results){
       domaineRepository.save(result)
   }
}
}

The application works well on files with few lines, but when i try to process large files (200 000 lines), the performance slows drastically, and i get a SQL timeout. I suspect cache issues, but i'm wondering if saving all the entities at the end of the processing isn't a bad practice ?

Nero Ouali
  • 208
  • 2
  • 4
  • 17
  • 1
    That is how hibernate works. Each time you save an entity it is added to the first level cache, this cache is bound to the transaction. Now each time you call save it does a dirty check on all the entities it has in the first level cache. The first 100 probably aren't noticeable but after that it adds up. What you need to do is after each X entities saved do a `flush` and `clear` the entitymanger/session (depending on if you are using JPA or plain hibernate). See also https://stackoverflow.com/questions/39469760/how-to-use-hibernate-batch-processing – M. Deinum Feb 21 '18 at 14:51
  • @M.Deinum : thanks for your answer. but the decreasing performance happens in the process part, not the save part. with a big file in input the application never reaches the save method, it crashes with SQL timeout way before (while modifying jpa entities, i'm not saving anything at this level) – Nero Ouali Feb 21 '18 at 15:46
  • It doesn't matter. When querying the same applies. Before doing a query hibernate also dirty checks the entities in the first level cache, if there are dirty entities those changes are persisted to the database before issuing the select. I strongly suggest a read on how JPA (or hibernate) works . – M. Deinum Feb 21 '18 at 19:30

2 Answers2

0

The problem is your for loop which is doing individual saves on the result and thus does single inserts slowing it down. Hibernate and spring support batch inserts and should be done when ever possible.

something like domaineRepository.saveAll(results)

Since you are processing lot of data it might be better to do things in batches so instead of getting one company to attach you should get a list of companies to attach processes those then get a list of companies to detach and process those

public EntityList process(Input input) :
   EntityList results;
   List<Code> companiesToAdd = new ArrayList<>();
   List<Siret> companiesToRemove = new ArrayList<>(); 
   for(Line line : input.readLine()){
      companiesToAdd.add(line.getCode());
      companiesToRemove.add(line.getSiret());
   ...
   }
   results = process(companiesToAdd, companiesToRemove);
   return results;
}

public MyEntity process(List<Code> companiesToAdd, List<Siret> companiesToRemove) {
  List<MyEntity> attachList = domaineRepository.getCompanyByCodeIn(companiesToAdd);
  List<MyEntity> detachList = domaineRepository.getCompanyBySiretIn(companiesToRemove);
  if (attachList.isEmpty() || detachList.isEmpty()) {
    throw new CustomException("Custom Exception");
  }
  companydomaineService.attachCompany(attachList, detachList);
  return attachList;
}

The above code is just pseudo code to point you in the right direction, will need to work out what works for you.

knittl
  • 246,190
  • 53
  • 318
  • 364
locus2k
  • 2,802
  • 1
  • 14
  • 21
  • thanks for the answer, i'll recheck the saving part. But the slowing down is happening in process method. Actually i'm updating jpa entities and adding them to a List, that list is passed to the WriteResult method that saves all the entities. this last method is never reached when processing a large file. – Nero Ouali Feb 21 '18 at 15:51
  • Can you update your process function with the methods that you use if thats where it is hanging up. It was hard to tell thats were the hangup was without any code. – locus2k Feb 21 '18 at 15:54
  • I updated the process method a little bit, so basically i'm just requesting objects and updating my entities without saving them. – Nero Ouali Feb 21 '18 at 16:24
  • I updated my answer. I believe your problem is you are doing a lot of single transactions which is hitting the database hard. Try to do things in bulk transactions which typically means returning lists of items instead of single items, then act on the list as needed. The less you can hit the database the faster it should go. – locus2k Feb 21 '18 at 17:34
0

For every line you read you are doing 2 read operations here

MyEntity companyToAttach = domaineRepository.getCompanyByCode(line.getCode()); MyEntity companyToDetach = domaineRepository.getCompanyBySiret(line.getSiret());

You can read more than one line and us the in query and then process that list of companies

Tarun Jain
  • 262
  • 1
  • 8