4

I am using spring and hibernate in my project and few day ago I found that Dev environment has crashed due to Java out of heap space exception. After some preliminary analysis using some heap analysis tools and visual vm, I found that the problem is with the one select SQL query. I rewrote the SQL in a different way which solved the memory issue. But now I am not sure why the previous SQL has caused the memory issue. Note: The method is inside a DAO and is called in a while loop with a batch size of 800 until all the data is pulled. Table size is around 20 million rows. For each call, a new hibernate session is created and destroyed.

Previous SQL:

@Override
public List<Book> getbookByJournalId(UnitOfWork uow,
        List<Journal> batch) {
    StringBuilder sb = new StringBuilder();
    sb.append("select i from Book i where ( ");
    if (batch == null || batch.size() <= 0)
        sb.append("1=0 )");
    else {
        for (int i = 0; i < batch.size(); i++) {
            if (i > 0)
                sb.append(" OR ");
            sb.append("( i.journalId='" + batch.get(i).journalId() + "')");
        }
        sb.append(")");
        sb.append(
                " and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :book ");
    }

    Query query = uow.getSession().createQuery(sb.toString());
    query.setParameter("isNotDummy", Definitions.BooleanIdentifiers_Char.No);
    query.setParameter("Book", "%" + Definitions.NOBook);
    query.setParameter("BookStatus", Definitions.BookStatusID.CLOSED.getValue());
    List<Book> bookList = (List<Book>) query.getResultList();
    return bookList;
}

Rewritten SQL:

@Override
public List<Book> getbookByJournalId(UnitOfWork uow,
        List<Journal> batch) {
    List<String> bookIds = new ArrayList<>();
    for(Journal J : batch){
        bookIds.add(J.getJournalId());
    }
    StringBuilder sb = new StringBuilder();
    sb.append("select i from Book i where i.journalId in (:bookIds) and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :Book");

    Query query = uow.getSession().createQuery(sb.toString());
    query.setParameter("isNotDummy", Definitions.BooleanIdentifiers_Char.No);
    query.setParameter("Book", "%" + Definitions.NOBook);
    query.setParameter("BookStatus", Definitions.BookStatusID.CLOSED.getValue());
    query.setParameter("specimenNums",specimenNums);
    query.setParameter("bookIds", bookIds);
    List<Book> bookList = (List<Book>) query.getResultList();
    return bookList;
}
  • Where are you setting `"bookIds` in the rewritten SQL? ALso to answer your performance qs, please check https://stackoverflow.com/questions/3074713/in-vs-or-in-the-sql-where-clause – Amit K Bist Nov 21 '17 at 20:20
  • Depending on the size of your `Journal` list you create a massive amount of `String` s. Now you only create a single one which will be reused. – M. Deinum Nov 22 '17 at 06:52
  • Yes, I forgot to set that parameter here in the question, so i have edited it now. – Krishna Bakka Nov 23 '17 at 17:05
  • Hi Deinum, i agree with you but that doesnt lead to out of heap error, since that whole query string is in method scope and it will garbage collected once the method is executed. I think this might be related to hibernate which i am unable to figure out. – Krishna Bakka Nov 23 '17 at 17:11

1 Answers1

0

When you create dynamic SQL statements, you miss out on ability of the database to cache the statement, indexes and even entire tables to optimise your data retrieval. That said, dynamic SQL can still be a practical solution. But you need to be a good citizen on the both the application and database servers, by being very efficient with your memory usage. For a solution that needs to scale to 20 million rows, I recommend using more of a disk-based approach, using as little RAM as possible (i.e. avoiding arrays).

Problems I can see from the first statement are the following:

Up to 800 OR conditions may be added to the first statement for each batch. That makes for a very long SQL statement (not good). This I believe [please correct me if I'm wrong] would need to be cached in JVM heap and then passed to the database. Java may not release this statement from the heap straight away, and garbage collection might be too slow to keep up with your code, increasing the RAM usage. You shouldn't rely on it to clean up after you while your code is running.

If you ran this code in parallel, many sessions on hibernate may risk having many sessions on the database too. I believe you should only use one session for this, unless there is a specific reason. Creating and destroying sessions that you don't need just creates unnecessary traffic on servers and the network. If you are running this code serially, then why drop the session, when you can reuse it for the next batch? You may have a valid reason, but the question must be asked.

In the second statement, creating the bookIds array again uses up RAM in the JVM heap, and the where i.journalId in (:bookIds) part of the SQL will still be lengthy. Not as bad as before, but I think still too long. You would be much better off doing the following:

Create a table on the database, with batchNumber, bookId and perhaps some meta-data, such as flags or timestamps. Join the Book table to your new table using a static statement, and pass in the batchNumber as a new parameter.

create table Batch
(
    id integer primary key,
    batchNumber integer not null,
    bookId integer not null,
    processed_datetime timestamp
);

create unique index Batch_Idx on Batch (batchNumber, bookId);

-- Put this statement into a loop, or use INSERT/SELECT if the data is available in the database
insert into Batch batchNumber values (:batchNumber, :bookId);

-- Updated SQL statement. This is now static. Note that batchNumber needs to be provided as a parameter.
select i
from Book i 
inner join Batch b on b.bookId = i.journalId
where b.batchNumber = :batchNumber
and i.isDummy=:isNotDummy and i.statusId !=:BookStatus and i.BookNumber like :Book;
bruceskyaus
  • 784
  • 4
  • 14
  • + heap will not only keep the whole string in memory, but every intermediate string generated for in iteration as well. – Ankit Mar 06 '19 at 07:06