22

I need to make my web-app work with really huge datasets. At the moment I get either OutOfMemoryException or output which is being generated 1-2 minutes.

Let's put it simple and suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one. Latter table has several fields including 'workerId' and 'hoursWorked' fields among others. What we need is:

  1. count total hours worked by each user;

  2. list of work periods for each user.

The most straightforward approach (IMO) for each task in plain SQL is:

1)

select Worker.name, sum(hoursWorked) from Worker, WorkLog 
   where Worker.id = WorkLog.workerId 
   group by Worker.name;

//results of this query should be transformed to Multimap<Worker, Long>

2)

select Worker.name, WorkLog.start, WorkLog.hoursWorked from Worker, WorkLog
   where Worker.id = WorkLog.workerId;

//results of this query should be transformed to Multimap<Worker, Period>
//if it was JDBC then it would be vitally 
//to set resultSet.setFetchSize (someSmallNumber), ~100

So, I have two questions:

  1. how to implement each of my approaches with JPA (or at least with Hibernate);
  2. how would you handle this problem (with JPA or Hibernate of course)?
Roman
  • 64,384
  • 92
  • 238
  • 332
  • 1
    Are you trying to create a report, or are you trying to load a bunch of objects? If you are just trying to create a report, then do it in SQL like you said and be done with it. – Zak May 03 '10 at 22:10
  • @Zak: I have a web application in jpa+spring+jsf which works. But its performance should be better. And, what is more important, it should be able to handle much larger datasets than it can handle at the moment. 1) There is a problem with 1st query that I don't know how to write it in `hql` or `jpa query language`. I don't want to use plain sql, imho it's a last resort. 2) The problem with 2nd query is that I don't know how to set fetch size in `JPA` and I also don't know how to handle this situation with `JPA`: there is no loop through resultset, I don't know how to load 'next' fetch. – Roman May 03 '10 at 22:18

6 Answers6

23

suppose that we have 2 tables in DB: Worker and WorkLog with about 1000 rows in the first one and 10 000 000 rows in the second one

For high volumes like this, my recommendation would be to use The StatelessSession interface from Hibernate:

Alternatively, Hibernate provides a command-oriented API that can be used for streaming data to and from the database in the form of detached objects. A StatelessSession has no persistence context associated with it and does not provide many of the higher-level life cycle semantics. In particular, a stateless session does not implement a first-level cache nor interact with any second-level or query cache. It does not implement transactional write-behind or automatic dirty checking. Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session. Operations performed via a stateless session bypass Hibernate's event model and interceptors. Due to the lack of a first-level cache, Stateless sessions are vulnerable to data aliasing effects. A stateless session is a lower-level abstraction that is much closer to the underlying JDBC.

StatelessSession session = sessionFactory.openStatelessSession();
Transaction tx = session.beginTransaction();

ScrollableResults customers = session.getNamedQuery("GetCustomers")
    .scroll(ScrollMode.FORWARD_ONLY);
while ( customers.next() ) {
    Customer customer = (Customer) customers.get(0);
    customer.updateStuff(...);
    session.update(customer);
}

tx.commit();
session.close();

In this code example, the Customer instances returned by the query are immediately detached. They are never associated with any persistence context.

The insert(), update() and delete() operations defined by the StatelessSession interface are considered to be direct database row-level operations. They result in the immediate execution of a SQL INSERT, UPDATE or DELETE respectively. They have different semantics to the save(), saveOrUpdate() and delete() operations defined by the Session interface.

Derek Mahar
  • 27,608
  • 43
  • 124
  • 174
Pascal Thivent
  • 562,542
  • 136
  • 1,062
  • 1,124
  • @Pascal Thivent: thanks for the answer! About volumes: I don't know the real volumes, I just specified the maximum (in my opinion which is based on some knowledge of the domain). Maybe the real volume is 10-100 times less and IMHO the solution for these volumes will be ok too. – Roman May 03 '10 at 22:33
  • Do you know what they exactly mean by "stateless sessions are vulnerable to data aliasing effects"? Thanks. – Guido Nov 21 '12 at 23:52
  • This is in no way faster. In fact, it is **extremely** slow and way less performant than the usual use of `EntityManager`. – phil294 Feb 16 '17 at 21:31
4

It seems you can do this with EclipseLink too. Check this : http://wiki.eclipse.org/EclipseLink/Examples/JPA/Pagination :

Query query = em.createQuery...
query.setHint(QueryHints.CURSOR, true)
     .setHint(QueryHints.SCROLLABLE_CURSOR, true)
ScrollableCursor scrl = (ScrollableCursor)q.getSingleResult();
Object o = null;
while ((o = scrl.next()) != null) { ... }
lapsus63
  • 99
  • 1
  • 3
3

There are several techniques that may need to be used in conjunction with one another to create and manipulate queries for large data-sets where memory is a limitation:

  1. Use setFetchSize(some value, maybe 100+) as the default (via JDBC) is 10. This is more about performance and is the single biggest related factor thereof. Can be done in JPA using queryHint available from provider (Hibernate, etc). There does not (for whatever reason) seem to be a JPA Query.setFetchSize(int) method.
  2. Do not try to marshall the entire result-set for 10K+ records. Several strategies apply: For GUIs, use paging or a framework that does paging. Consider Lucene or commercial searching/indexing engines (Endeca if the company has the money). For sending data somewhere, stream it and flush the buffer every N records to limit how much memory is used. The stream may be flushed to a file, network, etc. Remember that underneath, JPA uses JDBC and JDBC keeps the result-set on the Server, only fetching N-rows in a row-set group at a time. This break-down can be manipulated to facilitate flushing data in groups.
  3. Consider what the use-case is. Typically, an application is trying to answer questions. When the answer is to weed through 10K+ rows, then the design should be reviewed. Again, consider using indexing engines like Lucene, refine the queries, consider using BloomFilters as contains check caches to find needles in haystacks without going to the database, etc.
Darrell Teague
  • 4,132
  • 1
  • 26
  • 38
1

Raw SQL shouldn't be considered a last resort. It should still be considered an option if you want to keep things "standard" on the JPA tier, but not on the database tier. JPA also has support for native queries where it will still do the mapping to standard entities for you.

However, if you have a large result set that cannot be processed in the database, then you really should just use plain JDBC as JPA (standard) does not support streaming of large sets of data.

It will be harder to port your application across different application servers if you use JPA implementation specific constructs since the JPA engine is embedded in the application server and you may not have a control on which JPA provider is being used.

Archimedes Trajano
  • 35,625
  • 19
  • 175
  • 265
  • this. I found that executing a connection query manually (`session.doWork` or such) is in fact the fastest you can get – phil294 Feb 17 '17 at 14:51
  • the standard EntityManager has no `doWork` operation. – Archimedes Trajano Feb 17 '17 at 16:20
  • yes, that's why I wrote `session` which you can get via `entityManager.unwrap(Session.class);`. Idk if that is bad programming style though. I guess one could also write a `sessionFactory` Bean – phil294 Feb 17 '17 at 17:04
  • That would likely be hibernate specific. There's no Session class for JPA. For the benefit of others can you update your comment to state the full package of the session class? – Archimedes Trajano Feb 17 '17 at 17:06
  • 1
    I cannot update the comment anymore, so: Idea says: `org.hibernate.Session`. – phil294 Feb 17 '17 at 17:27
  • JDBC did a lot of things right but... respectfully - it is not good for large result sets. While the client-side tweaking of setFetchSize(int) is at the core of making this functional in terms of controlling client-side memory - it does not inherently solve the problem. That is, when one is looking for a (few) records in hundreds of thousands or millions - full text search engines provide superior performance and use-case support. – Darrell Teague Jun 02 '21 at 18:44
0

I'm using something like this and it works very fast. I also hate to use native SQL as our application should work on any database.

Folowing resutls into a very optimized sql and returns list of records which are maps.

String hql = "select distinct " +
            "t.uuid as uuid, t.title as title, t.code as code, t.date as date, t.dueDate as dueDate, " +
            "t.startDate as startDate, t.endDate as endDate, t.constraintDate as constraintDate, t.closureDate as closureDate, t.creationDate as creationDate, " +
            "sc.category as category, sp.priority as priority, sd.difficulty as difficulty, t.progress as progress, st.type as type, " +
            "ss.status as status, ss.color as rowColor, (p.rKey || ' ' || p.name) as project, ps.status as projectstatus, (r.code || ' ' || r.title) as requirement, " +
            "t.estimate as estimate, w.title as workgroup, o.name || ' ' || o.surname as owner, " +
            "ROUND(sum(COALESCE(a.duration, 0)) * 100 / case when ((COALESCE(t.estimate, 0) * COALESCE(t.progress, 0)) = 0) then 1 else (COALESCE(t.estimate, 0) * COALESCE(t.progress, 0)) end, 2) as factor " +
            "from " + Task.class.getName() + " t " +
            "left join t.category sc " +
            "left join t.priority sp " +
            "left join t.difficulty sd " +
            "left join t.taskType st " +
            "left join t.status ss " +
            "left join t.project p " +
            "left join t.owner o " +
            "left join t.workgroup w " +
            "left join p.status ps " +
            "left join t.requirement r " +
            "left join p.status sps " +
            "left join t.iterationTasks it " +
            "left join t.taskActivities a " +
            "left join it.iteration i " +
            "where sps.active = true and " +
            "ss.done = false and " +
            "(i.uuid <> :iterationUuid or it.uuid is null) " + filterHql +
            "group by t.uuid, t.title, t.code, t.date, t.dueDate, " +
            "t.startDate, t.endDate, t.constraintDate, t.closureDate, t.creationDate, " +
            "sc.category, sp.priority, sd.difficulty, t.progress, st.type, " +
            "ss.status, ss.color, p.rKey, p.name, ps.status, r.code, r.title, " +
            "t.estimate, w.title, o.name, o.surname " + sortHql;

    if (logger.isDebugEnabled()) {
        logger.debug("Executing hql: " + hql );
    }

    Query query =  hibernateTemplate.getSessionFactory().getCurrentSession().getSession(EntityMode.MAP).createQuery(hql);
    for(String key: filterValues.keySet()) {
        Object valueSet = filterValues.get(key);

        if (logger.isDebugEnabled()) {
            logger.debug("Setting query parameter for " + key );
        }

        if (valueSet instanceof java.util.Collection<?>) {
            query.setParameterList(key, (Collection)filterValues.get(key));
        } else {
            query.setParameter(key, filterValues.get(key));
        }
    }       
    query.setString("iterationUuid", iteration.getUuid());
    query.setResultTransformer(Transformers.ALIAS_TO_ENTITY_MAP);

    if (logger.isDebugEnabled()) {
        logger.debug("Query building complete.");
        logger.debug("SQL: " + query.getQueryString());
    }

    return query.list();
Bojan Kraut
  • 187
  • 1
  • 3
  • 8
  • 1
    Optimized? Can you explain this? – nalply Oct 05 '12 at 18:37
  • Depending on the resulting list size, this may crash and burn with a large data set. Have found that the general pattern of storing RI data in a database, then using a search engine (Lucene et al) on top of that is superior. Provides full-text search pattern support, superior performance, built-in paging without overburdening client memory requirements, etc. In short, almost never just return (some) Collection in response to (arbitrary) user queries in that it may be thousands (or millions?) of records in size. – Darrell Teague Jun 02 '21 at 18:48
0

I agree that doing the calculation on the database server is your best option in the particular case you mentioned. HQL and JPAQL can handle both of those queries:

1)

select w, sum(wl.hoursWorked) 
from Worker w, WorkLog wl
where w.id = wl.workerId 
group by w

or, if the association is mapped:

select w, sum(wl.hoursWorked) 
from Worker w join w.workLogs wl
group by w

both or which return you List where the Object[]s are Worker and Long. Or you could also use "dynamic instantiation" queries to wrap that up, for example:

select new WorkerTotal( select w, sum(wl.hoursWorked) )
from Worker w join w.workLogs wl
group by w

or (depending on need) probably even just:

select new WorkerTotal( select w.id, w.name, sum(wl.hoursWorked) )
from Worker w join w.workLogs wl
group by w.id, w.name

WorkerTotal is just a plain class. It must have matching constructor(s).

2)

select w, new Period( wl.start, wl.hoursWorked )
from Worker w join w.workLogs wl

this will return you a result for each row in the WorkLog table... The new Period(...) bit is called "dynamic instantiation" and is used to wrap tuples from the result into objects (easier consumption).

For manipulation and general usage, I recommend StatelessSession as Pascal points out.

Steve Ebersole
  • 9,339
  • 2
  • 48
  • 46