I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.
I'm using Wildfly 14 and JPA/Hibernate.
Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:
em.createQuery("SELECT e FROM Entity e").size();
After that, every object should be availabe in the transaction and thus be available via:
em.find(Entity.class, id);
But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.
How can I efficiently load the whole content of the required tables including the relationships and make sure I got everything / there will be no further DB calls?
What I already tried:
- FetchMode.EAGER: Still too many single selects / object graph too complex
- EntityGraphs: Same as FetchMode.EAGER
- Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
- 2nd Level / Query Cache: Not working, probably the same problem as
em.find
One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.
Edit:
My plan is to load and manage the entire data in a @Singleton
bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).