9

I have to process a huge amount of data distributed over 20 tables (~5 million records in summary) and I need to efficently load them.

I'm using Wildfly 14 and JPA/Hibernate.

Since in the end, every single record will be used by the business logic (in the same transaction), I decided to pre-load the entire content of the required tables into memory via simply:

em.createQuery("SELECT e FROM Entity e").size();

After that, every object should be availabe in the transaction and thus be available via:

em.find(Entity.class, id);

But this doesn't work somehow and there are still a lot of calls to the DB, especially for the relationships.

How can I efficiently load the whole content of the required tables including the relationships and make sure I got everything / there will be no further DB calls?

What I already tried:

  • FetchMode.EAGER: Still too many single selects / object graph too complex
  • EntityGraphs: Same as FetchMode.EAGER
  • Join fetch statements: Best results so far, since it simultaneously populates the relationships to the referred entities
  • 2nd Level / Query Cache: Not working, probably the same problem as em.find

One thing to note is that the data is immutable (at least for a specific time) and could also be used in other transactions.

Edit:

My plan is to load and manage the entire data in a @Singleton bean. But I want to make sure I'm loading it the most efficient way and be sure the entire data is loaded. There should be no further queries necessary when the business logic is using the data. After a specific time (ejb timer), I'm going to discard the entire data and reload the current state from the DB (always whole tables).

Meini
  • 77
  • 3
  • 19
  • That is a terrible plan for lessening number of queries. What are you trying to solve? No matter how I look at it, you cache your entire db, which is not how things can work. Also, if you have more than one server... Why not make a mirror db? You can copy everything there and send your queries to mirror. This will effectively do the same without replication problems – Andrii Plotnikov Nov 18 '18 at 01:53
  • I want to load static data from a few consistent tables (not "entire db") into memory which is **entirely** needed in several time critical processes within a short time frame on a server where memory is practically not limited. I can't see where this is "not how things can work" or "terrible" at all. Why would i need a mirror db and it's overhead, if there is no transactions/concurrency needed? Also, to be effective, the mirror db has to be in memory, so where is the difference/benefit to my approach? – Meini Nov 19 '18 at 06:27

3 Answers3

6

Keep in mind, that you'll likely need a 64-bit JVM and a large amount of memory. Take a look at Hibernate 2nd Level Cache. Some things to check for since we don't have your code:

  1. @Cacheable annotation will clue Hibernate in so that the entity is cacheable
  2. Configure 2nd level caching to use something like ehcache, and set the maximum memory elements to something big enough to fit your working set into it
  3. Make sure you're not accidentally using multiple sessions in your code.

If you need to process things in this way, you may want to consider changing your design to not rely on having everything in memory, not using Hibernate/JPA, or not use an app server. This will give you more control of how things are executed. This may even be a better fit for something like Hadoop. Without more information it's hard to say what direction would be best for you.

Jason Armstrong
  • 1,058
  • 9
  • 17
  • I configured 2nd lvl cache / Cacheable, but afaik, only em.find() will take use the 2nd lvl cache. When i access OneToMany collections there are still a lot queries to DB, even when they are defined as FetchType.EAGER. I also have to / should stick with an application server because it gives me a lot infrastructure i need. The server runs on a machine with 144gB memory, it should be enough. – Meini Nov 13 '18 at 05:49
  • @Meini What's are you JVM memory settings set to, and do you have a 64-bit version installed? Even if your server has 144GB, your JVM has to be configured to handle it. – Jason Armstrong Nov 14 '18 at 11:04
  • I have java 8 64-bit installed, memory is limited at -Xmx120g to leave space for other apps. My test data is just a few gB (~5). – Meini Nov 15 '18 at 06:36
  • Well, 2nd level cache won't help me anyways, because i still have to load the data and that's what the question is about. 2nd level cache could be a way to keep data in memory, but the data is only indexed by primary key, which means queries won't make use of it, just things like em.find() will work. – Meini Nov 16 '18 at 06:02
5

I understand what you're asking but JPA/Hibernate isn't going to want to cache that much data for you, or at least I wouldn't expect a guarantee from it. Consider that you described 5 million records. What is the average length per record? 100 bytes gives 500 megabytes of memory that'll just crash your untweaked JVM. Probably more like 5000 bytes average and that's 25 gB of memory. You need to think about what you're asking for.

If you want it cached you should do that yourself or better yet just use the results when you have them. If you want a memory based data access you should look at a technology specifically for that. http://www.ehcache.org/ seems popular but it's up to you and you should be sure you understand your use case first.

If you are trying to be database efficient then you should just understand what your doing and design and test carefully.

K.Nicholas
  • 10,956
  • 4
  • 46
  • 66
  • *"100 bytes gives 500 megabytes of permgen memory"* - Why permgen? Note that it doesn't even exist for Java 8 onwards. But even in older JVMs, permgen is only used for certain things. And ordinary string data is NOT one of those things. – Stephen C Nov 17 '18 at 02:22
  • 1
    Fair enough. Perhaps heap would be better for the example at hand. I'll just take the specifics out. When loading with joins in JPA it can use a lot of *MetaSpace* which can cause problems. You need to be aware of this when running in a limit environment like a container. Even though the final memory requirements may *only* be 25 GB loading that much of something more complicated than strings will require a bit more than 25 GB of system memory. – K.Nicholas Nov 17 '18 at 03:13
  • But why Metaspace? Does JPA somehow use Metaspace to hold data? Do you have a reference for that? – Stephen C Nov 17 '18 at 05:12
  • I don't have a reference now but I haven't looked very hard. I had to set MAX_METASPACE = 192m, up from default of I think 128m or 64m for a docker/wildfly based project on openshift, in order for a read of some 10-30K rows into a set of entities to hold in memory. The system requires 720m to start but when running with entities loaded into memory is using only about 380 heap/non-heap total. Max is about 300/600 heap/non-heap max. I traced it carefully to the JPA fetch code and the setting was specifically Metaspace to resolve issue. – K.Nicholas Nov 17 '18 at 05:38
  • My point to the above question was that you can't just say you want to 5 million rows into memory cause you think it's a good idea, especially using a sophisticated object-oriented persistence access framework like JPA because you need joins as well. – K.Nicholas Nov 17 '18 at 05:40
  • In your example, the metaspace was most likely due to (excessive) generation of dynamic proxy classes ... not the amount of data involved. Metaspace is used to hold code and class-related descriptors. Metaspace doesn't hold data. So I think you have drawn an incorrect conclusion from a fix that applied to a specific problem. – Stephen C Nov 17 '18 at 05:47
  • In addition, according to https://blogs.oracle.com/poonam/about-g1-garbage-collector,-permanent-generation-and-metaspace, metaspace is by default only limited the amount of available memory / address space. The manual entry for the `java` command agrees. The "default" you were seeing must have been something set by the application-specific wrapper / launcher. – Stephen C Nov 17 '18 at 05:54
  • Yup, like I said, granted, and I changed my response. I am assuming heap holds the data, but the metaspace is used when hibernate needs to do it's thing to get everything loaded. I work more on app design and am basically familiar with JVM details as needed. Yes, with my project there will be perhaps 400 entities with somewhere between 10-1000 relationships of various types associated with each. – K.Nicholas Nov 17 '18 at 05:55
  • Re metaspace is limited only by system memory but that is constrained when running in a container. There can be issues recently resolved in newer versions of JVM were the JVM thinks is has all the memory the system has but the container kills it when it uses more than the container allows. New versions of JVM are more container aware and this can be important when trying to tune memory settings in cloud based containers. You can find plenty online about that. – K.Nicholas Nov 17 '18 at 05:59
5

Basically it should be a pretty easy task to load entire tables with one query each table and link the objects, but JPA works different as to be shown in this example.

The biggest problem are @OneToMany/@ManyToMany-relations:

@Entity
public class Employee {
    @Id
    @Column(name="EMP_ID")
    private long id;
    ...
    @OneToMany(mappedBy="owner")
    private List<Phone> phones;
    ...
}
@Entity
public class Phone {
    @Id
    private long id;    
    ...
    @ManyToOne
    @JoinColumn(name="OWNER_ID")
    private Employee owner;
    ...
}

FetchType.EAGER

If defined as FetchType.EAGER and the query SELECT e FROM Employee e Hibernate generates the SQL statement SELECT * FROM EMPLOYEE and right after it SELECT * FROM PHONE WHERE OWNER_ID=? for every single Employee loaded, commonly known as 1+n problem.

I could avoid the n+1 problem by using the JPQL-query SELECT e FROM Employee e JOIN FETCH e.phones, which will result in something like SELECT * FROM EMPLOYEE LEFT OUTER JOIN PHONE ON EMP_ID = OWNER_ID.

The problem is, this won't work for a complex data model with ~20 tables involved.

FetchType.LAZY

If defined as FetchType.LAZY the query SELECT e FROM Employee e will just load all Employees as Proxies, loading the related Phones only when accessing phones, which in the end will lead into the 1+n problem as well.

To avoid this it is pretty obvious to just load all the Phones into the same session SELECT p FROM Phone p. But when accessing phones Hibernate will still execute SELECT * FROM PHONE WHERE OWNER_ID=?, because Hibernate doesn't know that there are already all Phones in its current session.

Even when using 2nd level cache, the statement will be executed on the DB because Phone is indexed by its primary key in the 2nd level cache and not by OWNER_ID.

Conclusion

There is no mechanism like "just load all data" in Hibernate.

It seems there is no other way than keep the relationships transient and connect them manually or even just use plain old JDBC.

EDIT:

I just found a solution which works very well. I defined all relevant @ManyToMany and @OneToMany as FetchType.EAGER combinded with @Fetch(FetchMode.SUBSELECT) and all @ManyToOne with @Fetch(FetchMode.JOIN), which results in an acceptable loading time. Next to adding javax.persistence.Cacheable(true) to all entities I added org.hibernate.annotations.Cache to every relevant collection, which enables collection caching in the 2nd level cache. I disabled 2nd level cache timeout eviction and "warm up" the 2nd level cache via @Singleton EJB combined with @Startup on server start / deploy. Now I have 100% control over the cache, there are no further DB calls until I manually clear it.

Meini
  • 77
  • 3
  • 19
  • 2
    Still, in hibernate there is `FetchMode.SUBSELECT`. Quite important one because it can reduce data transferred from db to application server. Check this question about it https://stackoverflow.com/questions/32984799/fetchmode-join-vs-subselect – michaeak Nov 16 '18 at 17:47
  • Thx for your comment. Yeah, I'm aware of it and already tried to optimize queries with well placed `FetchMode.JOIN` and `FetchMode.SUBSELECT` and i could clearly reduce the query count, but it didn't gave me the results i was hoping for, maybe I should give it another try. – Meini Nov 19 '18 at 06:39