JPA: what is the proper pattern for iterating over large result sets?

Question

Let's say I have a table with millions of rows. Using JPA, what's the proper way to iterate over a query against that table, such that I don't have all an in-memory List with millions of objects?

For example, I suspect that the following will blow up if the table is large:

List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList();

for (Model model : models)
{
     System.out.println(model.getId());
}

Is pagination (looping and manually updating setFirstResult()/setMaxResult()) really the best solution?

Edit: the primary use-case I'm targeting is a kind of batch job. It's fine if it takes a long time to run. There is no web client involved; I just need to "do something" for each row, one (or some small N) at a time. I'm just trying to avoid having them all in memory at the same time.

What database and JDBC driver are you using? – Feb 21 '11 at 20:41 — , Feb 21 '11 at 20:41

score 58 · Accepted Answer · edited Apr 28 '15 at 09:13

58

Page 537 of Java Persistence with Hibernate gives a solution using ScrollableResults, but alas it's only for Hibernate.

So it seems that using setFirstResult/setMaxResults and manual iteration really is necessary. Here's my solution using JPA:

private List<Model> getAllModelsIterable(int offset, int max)
{
    return entityManager.createQuery("from Model m", Model.class).setFirstResult(offset).setMaxResults(max).getResultList();
}

then, use it like this:

private void iterateAll()
{
    int offset = 0;

    List<Model> models;
    while ((models = Model.getAllModelsIterable(offset, 100)).size() > 0)
    {
        entityManager.getTransaction().begin();
        for (Model model : models)
        {
            log.info("do something with model: " + model.getId());
        }

        entityManager.flush();
        entityManager.clear();
        em.getTransaction().commit();
        offset += models.size();
    }
}

edited Apr 28 '15 at 09:13

Community

1
1

answered Feb 21 '11 at 21:24

George Armhold

30,824
50
153
232

36

I think the example is not safe if there are new inserts during the batch process. The user must order based on a column where it is sure that newly inserted data will be on the end of the result list. – Balazs Zsoldos Nov 15 '12 at 15:22
when the current page is the last page and has fewer than 100 elements checking `size() == 100` instead will skip one additional query that returns empty list – cdalxndr May 12 '20 at 18:52
No longer only for Hibernate: https://stackoverflow.com/a/69486331/574263 – mjaggard Jun 08 '22 at 11:17

Zds · Answer 2 · 2015-09-07T08:42:27.233

44

I tried the answers presented here, but JBoss 5.1 + MySQL Connector/J 5.1.15 + Hibernate 3.3.2 didn't work with those. We've just migrated from JBoss 4.x to JBoss 5.1, so we've stuck with it for now, and thus the latest Hibernate we can use is 3.3.2.

Adding couple of extra parameters did the job, and code like this runs without OOMEs:

        StatelessSession session = ((Session) entityManager.getDelegate()).getSessionFactory().openStatelessSession();

        Query query = session
                .createQuery("SELECT a FROM Address a WHERE .... ORDER BY a.id");
        query.setFetchSize(Integer.valueOf(1000));
        query.setReadOnly(true);
        query.setLockMode("a", LockMode.NONE);
        ScrollableResults results = query.scroll(ScrollMode.FORWARD_ONLY);
        while (results.next()) {
            Address addr = (Address) results.get(0);
            // Do stuff
        }
        results.close();
        session.close();

The crucial lines are the query parameters between createQuery and scroll. Without them the "scroll" call tries to load everything into memory and either never finishes or runs to OutOfMemoryError.

edited Sep 07 '15 at 08:42

answered Mar 03 '11 at 15:35

Zds

4,311
2
24
28

2

Hi Zds, your use case of scanning millions of rows is certainly common for me, and THANK YOU for posting the final code. In my case I'm shoving records into Solr, to index them for fulltext search. And, due to business rules I won't go into, I need to go via Hibernate, vs. just using JDBC or Solr's built-in modules. – Mark Bennett Feb 02 '12 at 00:12
Happy to help :-). We are also dealing with large data sets, in this case allowing user to query all street names within the same city/county, or sometimes even state, so creating indices requires reading lot of data. – Zds Apr 10 '12 at 13:33
Appears with MySQL you really do have to go through all those hoops: http://stackoverflow.com/a/20900045/32453 (other DB's might be less stringent I'd imagine...) – rogerdpack Jul 01 '16 at 21:31

Cyberax · Answer 3 · 2011-10-21T07:54:45.240

32

You can't really do this in straight JPA, however Hibernate has support for stateless sessions and scrollable result sets.

We routinely process billions of rows with its help.

Here is a link to documentation: http://docs.jboss.org/hibernate/core/3.3/reference/en/html/batch.html#batch-statelesssession

edited Oct 21 '11 at 07:54

answered Feb 21 '11 at 22:00

Cyberax

1,667
16
18

17

Thanks. Good to know someone is doing billions of rows through Hibernate. Some folks here are claiming it's impossible. :-) – George Armhold Feb 21 '11 at 23:40
2

Possible to add an example here as well? I assume it's similar to Zds's example? – rogerdpack Jul 01 '16 at 21:09

score 19 · Answer 4 · answered Feb 21 '11 at 15:34

19

To be honest, I would suggest leaving JPA and stick with JDBC (but certainly using JdbcTemplate support class or such like). JPA (and other ORM providers/specifications) is not designed to operate on many objects within one transaction as they assume everything loaded should stay in first-level cache (hence the need for clear() in JPA).

Also I am recommending more low level solution because the overhead of ORM (reflection is only a tip of an iceberg) might be so significant, that iterating over plain ResultSet, even using some lightweight support like mentioned JdbcTemplate will be much faster.

JPA is simply not designed to perform operations on a large amount of entities. You might play with flush()/clear() to avoid OutOfMemoryError, but consider this once again. You gain very little paying the price of huge resource consumption.

answered Feb 21 '11 at 15:34

Tomasz Nurkiewicz

334,321
69
703
674

The advantage of JPA is being not just database agnostic but the possibility of not even using a traditional database (NoSQL). Its not to hard to do flush/clear every now and then and usually batch operations are done infrequently. – Adam Gent Feb 21 '11 at 15:39
1

Hi Thomasz. I have plenty of reasons to complain about JPA/Hibernate, but respectfully, I really doubt that they are "not designed to operate on many objects". I suspect that I just need to learn the proper pattern for this use-case. – George Armhold Feb 21 '11 at 16:04
4

Well, I can think only of two patterns: paginations (mentioned several times) and `flush()`/`clear()`. The first one is IMHO not designed for the purposes of batch processing, while using sequence of flush()/clear() smells like *leaky abstraction*. – Tomasz Nurkiewicz Feb 21 '11 at 17:20
Yup, it was a combination of pagination and the flush/clear as you mentioned. Thanks! – George Armhold Feb 22 '11 at 14:04

Filippo Rossoni · Answer 5 · 2013-02-12T14:40:05.263

7

If you use EclipseLink I' using this method to get result as Iterable

private static <T> Iterable<T> getResult(TypedQuery<T> query)
{
  //eclipseLink
  if(query instanceof JpaQuery) {
    JpaQuery<T> jQuery = (JpaQuery<T>) query;
    jQuery.setHint(QueryHints.RESULT_SET_TYPE, ResultSetType.ForwardOnly)
       .setHint(QueryHints.SCROLLABLE_CURSOR, true);

    final Cursor cursor = jQuery.getResultCursor();
    return new Iterable<T>()
    {     
      @SuppressWarnings("unchecked")
      @Override
      public Iterator<T> iterator()
      {
        return cursor;
      }
    }; 
   }
  return query.getResultList();  
}

close Method

static void closeCursor(Iterable<?> list)
{
  if (list.iterator() instanceof Cursor)
    {
      ((Cursor) list.iterator()).close();
    }
}

edited Feb 12 '13 at 14:40

answered Feb 05 '13 at 14:13

Filippo Rossoni

81
1
4

I gave your code a try but still get OOM - it appears all T objects (and all joined table objects referred from T) are never GC. Profiling shows them being referred to from "table" in org.eclipse.persistence.internal.sessions.RepeatableWriteUnitOfWork along with org.eclipse.persistence.internal.identitymaps.CacheKey. I looked into cache and my settings are all default (Disable Selective, Weak with Soft Subcache, Cache Size 100, Drop Invalidate). I will look into disabling sessions and see if it helps. BTW I simply iterate over return cursor using "for (T o: results)". – Edi Bice Jul 08 '16 at 15:52
Badum tssssssss – Oct 23 '18 at 00:50

score 5 · Answer 6 · answered Feb 21 '11 at 15:27

5

It depends upon the kind of operation you have to do. Why are you looping over a million of row? Are you updating something in batch mode? Are you going to display all records to a client? Are you computing some statistics upon the retrieved entities?

If you are going to display a million records to the client, please reconsider your user interface. In this case, the appropriate solution is paginating your results and using setFirstResult() and setMaxResult().

If you have launched an update of a large amount of records, you'll better keep the update simple and use Query.executeUpdate(). Optionally, you can execute the update in asynchronous mode using a Message-Driven Bean o a Work Manager.

If you are computing some statistics upon the retrieved entities, you can take advantage on the grouping functions defined by the JPA specification.

For any other case, please be more specific :)

answered Feb 21 '11 at 15:27

frm

3,346
1
21
21

Quite simply, I need to do something "for each" row. Surely this is a common use case. In the specific case I am working on now, I need to query an external web service that is totally outside of my database, using an id (the PK) from each row. The results are not displayed back to any client web browser, so there is no user interface to speak of. It's a batch job, in other words. – George Armhold Feb 21 '11 at 15:33
If you "need" print id for each row, there is no other way as get each row, get id and print. Best solution depends on what you need to do. – Dainius Feb 21 '11 at 16:05
@Caffeine Coma, if you only need the id of each row then the biggest improvement would probably come from only fetching that column, as `SELECT m.id FROM Model m` and then iterating over a List. – Jörn Horstmann Feb 21 '11 at 16:10
1

@Jörn Horstmann- if there are millions of rows, will it really matter? My point is that an ArrayList with millions of objects (however small) is not going to be good for the JVM heap. – George Armhold Feb 21 '11 at 16:13
@Dainius: my question is really: "how can I iterate over each row, without having the whole ArrayList in-memory?" In other words, I'd like an interface for pulling N at a time, where N is significantly smaller than 1 million. :-) – George Armhold Feb 21 '11 at 16:16
In this case, I think the only optimization should be to fetch only the id, as Jörn Horstmann said AND using pagination to query one chunk of data at a time. With these improvements, you should minimize the use of memory (work on data chunks) and the time spent on converting objects (fetch only an Integer instead of an entire object). – frm Feb 21 '11 at 16:19
@frm: this will will still take way more time than just using straight JDBC and a server side cursor. – Feb 21 '11 at 22:24
or maybe will help Query.iterate() http://www.javalobby.org/articles/hibernate-query-101/ – Dainius Feb 22 '11 at 08:42

score 5 · Answer 7 · edited May 23 '17 at 12:03

5

There is no "proper" what to do this, this isn't what JPA or JDO or any other ORM is intended to do, straight JDBC will be your best alternative, as you can configure it to bring back a small number of rows at a time and flush them as they are used, that is why server side cursors exist.

ORM tools are not designed for bulk processing, they are designed to let you manipulate objects and attempt to make the RDBMS that the data is stored in be as transparent as possible, most fail at the transparent part at least to some degree. At this scale, there is no way to process hundreds of thousands of rows ( Objects ), much less millions with any ORM and have it execute in any reasonable amount of time because of the object instantiation overhead, plain and simple.

Use the appropriate tool. Straight JDBC and Stored Procedures definitely have a place in 2011, especially at what they are better at doing versus these ORM frameworks.

Pulling a million of anything, even into a simple List<Integer> is not going to be very efficient regardless of how you do it. The correct way to do what you are asking is a simple SELECT id FROM table, set to SERVER SIDE ( vendor dependent ) and the cursor to FORWARD_ONLY READ-ONLY and iterate over that.

If you are really pulling millions of id's to process by calling some web server with each one, you are going to have to do some concurrent processing as well for this to run in any reasonable amount of time. Pulling with a JDBC cursor and placing a few of them at a time in a ConcurrentLinkedQueue and having a small pool of threads ( # CPU/Cores + 1 ) pull and process them is the only way to complete your task on a machine with any "normal" amount of RAM, given you are already running out of memory.

See this answer as well.

edited May 23 '17 at 12:03

Community

1
1

answered Feb 21 '11 at 16:34

1

So you're saying that no company ever needs to visit every row of their users table? Their programmers just throw Hibernate out the window when it comes time to do this? "*there is no way to process hundreds of thousands of rows*"- in my question I pointed out setFirstResult/setMaxResult, so clearly there *is* a way. I'm asking if there is a better one. – George Armhold Feb 21 '11 at 16:48
*"Pulling a million of anything, even into a simple List is not going to be very efficient regardless of how you do it."* That is exactly my point. I'm asking how *not* to create the giant list, but rather to iterate over a result set. – George Armhold Feb 21 '11 at 16:51
Use a simple straight JDBC select statement with a FORWARD_ONLY READ_ONLY with a SERVER_SIDE cursor as I suggested in my answer. How to make JDBC use a SERVER_SIDE cursor is database driver dependent. – Feb 21 '11 at 20:37
Visiting every "row" in a table is just what JDBC is designed to do. JPA is designed to create one or more Objects from each row in a table, completely different semantics and uses. JPA is a dump truck for hauling around objects, you want performance you use a Sports Car. So yes, developers with large scale data manipulation "throw out the ORM and hit the metal with straight JDBC". Experience has taught me to use the right tool for the right job, and experience teaches you what those tools are. – Feb 22 '11 at 16:28
1

I fully agree with the answer. The best solution is dependent on the problem. If the problem is loading a few entities easily JPA is good. If the problem is using huge amounts of data efficiently direct JDBC is better. – extraneon May 05 '11 at 18:38
4

Scanning through millions of records is common for a number of reasons, for example indexing them into a search engine. And although I agree that JDBC is normally a more direct route, you sometimes walk into a project that already has very complex business logic bundled up in a Hibernate layer. If you bypass it and go to JDBC, you bypass the business logic, which is sometimes non-trivial to re-implement and maintain. When folks post questions about atypical use cases they often know it's a bit weird, but may be inheriting something vs. building from scratch, and maybe can't disclose details. – Mark Bennett Feb 02 '12 at 00:20
Bypassing the ORM can be hazardous if you use caching. If you mainly use DB via EJB3, bypassing it can cause lot of unnecessary headache due to caching. – Zds Feb 10 '12 at 11:26
1

I think there is a "proper" way now. https://stackoverflow.com/a/69486331/574263 – mjaggard Jun 08 '22 at 11:17

score 4 · Answer 8 · answered Oct 16 '13 at 14:05

You can use another "trick". Load only collection of identifiers of the entities you're interested in. Say identifier is of type long=8bytes, then 10^6 a list of such identifiers makes around 8Mb. If it is a batch process (one instance at a time), then it's bearable. Then just iterate and do the job.

One another remark - you should anyway do this in chunks - especially if you modify records, otherwise rollback segment in database will grow.

When it comes to set firstResult/maxRows strategy - it will be VERY VERY slow for results far from the top.

Also take into consideration that the database is probably operating in read commited isolation, so to avoid phantom reads load identifiers and then load entities one by one (or 10 by 10 or whatever).

Hi @Marcin, can you or anyone else provide a link to example code applying this chunked and id-first stepwise approach, preferably using Java8 streams? — krevelen, Sep 14 '16 at 11:17

mjaggard · Answer 9 · 2022-06-08T11:18:20.717

3

Finally the answer to what you want arrived in JPA 2.2 and for Hibernate (at least in v5.4.30), it uses the Scrollable implementation mentioned in a previous answer.

Your code can now look like this:

entityManager().createQuery("from Model m", Model.class)
                  .getResultStream();
                  .forEach(model -> System.out.println(model.getId());

edited Jun 08 '22 at 11:18

answered Oct 07 '21 at 18:44

mjaggard

2,389
1
23
45

For MySQL I also had to add `useCursorFetch=true` to the query string of my database URL. – mjaggard Dec 06 '22 at 10:32

Danger · Answer 10 · 2013-02-09T22:08:03.073

I was surprised to see that the use of stored procedures was not more prominent in the answers here. In the past when I've had to do something like this, I create a stored procedure that processes data in small chunks, then sleeps for a bit, then continues. The reason for the sleeping is to not overwhelm the database which is presumably also being used for more real time types of queries, such as being connected to a web site. If there is no one else using the database, then you can leave out the sleep. If you need to ensure that you process each record once and only once, then you will need to create an additional table (or field) to store which records you have processed in order to be resilient across restarts.

The performance savings here are significant, possibly orders of magnitude faster than anything you could do in JPA/Hibernate/AppServer land, and your database server will most likely have its own server side cursor type of mechanism for processing large result sets efficiently. The performance savings come from not having to ship the data from the database server to the application server, where you process the data, and then ship it back.

There are some significant downsides to using stored procedures which may completely rule this out for you, but if you've got that skill in your personal toolbox and can use it in this kind of situation, you can knock out these kinds of things fairly quickly.

-2 downvotes - would the next downvoter please defend your downvote? — Danger, Nov 27 '13 at 00:23
I thought the same thing while reading these. The question indicates a high volume batch job with no UI . Assuming that you do not need app server specific resources, why use an app server at all? Stored procedure would be much more efficient. — jdessey, Jul 08 '14 at 19:03
@jdessey Depending on the situation, let's say we have an import facility where on import it should do something with some other part of the system e.g. add rows to another table based on some business rules that have been coded already as an EJB. Then running in an app server would make more sense, unless you can get the EJB to run in an embedded mode. — Archimedes Trajano, Sep 30 '15 at 22:26

score 1 · Answer 11 · answered Sep 30 '15 at 22:43

To expand on @Tomasz Nurkiewicz's answer. You have access to the DataSource which in turn can provide you with a connection

@Resource(name = "myDataSource",
    lookup = "java:comp/DefaultDataSource")
private DataSource myDataSource;

In your code you have

try (Connection connection = myDataSource.getConnection()) {
    // raw jdbc operations
}

This will allow you to bypass JPA for some specific large batch operations like import/export, however you still have access to the entity manager for other JPA operations if you need it.

score 1 · Answer 12 · answered Feb 07 '20 at 16:11

Here's a simple, straight JPA example (in Kotlin) that shows how you can paginate over an arbitrarily large result set, reading chunks of 100 items at a time, without using a cursor (each cursor consumes resources on the database). It uses keyset pagination.

See https://use-the-index-luke.com/no-offset for the concept of keyset pagination, and https://www.citusdata.com/blog/2016/03/30/five-ways-to-paginate/ for a comparison of different ways to paginate along with their drawbacks.

/*
create table my_table(
  id int primary key, -- index will be created
  my_column varchar
)
*/

fun keysetPaginationExample() {
    var lastId = Integer.MIN_VALUE
    do {

        val someItems =
        myRepository.findTop100ByMyTableIdAfterOrderByMyTableId(lastId)

        if (someItems.isEmpty()) break

        lastId = someItems.last().myTableId

        for (item in someItems) {
          process(item)
        }

    } while (true)
}

score 1 · Answer 13 · answered Apr 02 '20 at 18:16

An Example with JPA and NativeQuery fetching everytime the size Elements using offsets

public List<X> getXByFetching(int fetchSize) {
        int totalX = getTotalRows(Entity);
        List<X> result = new ArrayList<>();
        for (int offset = 0; offset < totalX; offset = offset + fetchSize) {
            EntityManager entityManager = getEntityManager();
            String sql = getSqlSelect(Entity) + " OFFSET " + offset + " ROWS";
            Query query = entityManager.createNativeQuery(sql, X.class);
            query.setMaxResults(fetchSize);
            result.addAll(query.getResultList());
            entityManager.flush();
            entityManager.clear();
        return result;
    }

score 0 · Answer 14 · answered Jan 30 '14 at 04:19

With hibernate there are 4 different ways to achieve what you want. Each has design tradeoffs, limitations, and consequences. I suggest exploring each and deciding which is right for your situation.

Use stateless session with scroll()
Use session.clear() after every iteration. When other entities need to be attached, then load them in a separate session. effectively the first session is emulating the stateless session, but retaining all the features of a stateful session, until the objects are detached.
Use iterate() or list() but get only ids in the first query, then in a separate session in each iteration, do session.load and close the session at the end of the iteration.
Use Query.iterate() with EntityManager.detach() aka Session.evict();

Adam Gent · Answer 15 · 2011-02-21T15:42:10.353

I have wondered this myself. It seems to matter:

how big your dataset is (rows)
what JPA implementation you are using
what kind of processing you are doing for each row.

I have written an Iterator to make it easy to swap out both approaches (findAll vs findEntries).

I recommend you try both.

Long count = entityManager().createQuery("select count(o) from Model o", Long.class).getSingleResult();
ChunkIterator<Model> it1 = new ChunkIterator<Model>(count, 2) {

    @Override
    public Iterator<Model> getChunk(long index, long chunkSize) {
        //Do your setFirst and setMax here and return an iterator.
    }

};

Iterator<Model> it2 = List<Model> models = entityManager().createQuery("from Model m", Model.class).getResultList().iterator();


public static abstract class ChunkIterator<T> 
    extends AbstractIterator<T> implements Iterable<T>{
    private Iterator<T> chunk;
    private Long count;
    private long index = 0;
    private long chunkSize = 100;

    public ChunkIterator(Long count, long chunkSize) {
        super();
        this.count = count;
        this.chunkSize = chunkSize;
    }

    public abstract Iterator<T> getChunk(long index, long chunkSize);

    @Override
    public Iterator<T> iterator() {
        return this;
    }

    @Override
    protected T computeNext() {
        if (count == 0) return endOfData();
        if (chunk != null && chunk.hasNext() == false && index >= count) 
            return endOfData();
        if (chunk == null || chunk.hasNext() == false) {
            chunk = getChunk(index, chunkSize);
            index += chunkSize;
        }
        if (chunk == null || chunk.hasNext() == false) 
            return endOfData();
        return chunk.next();
    }

}

I ended up not using my chunk iterator (so it might not be that tested). By the way you will need google collections if you want to use it.

Regarding "what kind of processing you're doing for each row"- if the # of rows is in the millions, I suspect that even a simple object with only an id column is going to cause problems. I too thought about writing my own Iterator that wrapped setFirstResult/setMaxResult, but I figured that this must be a common (and hopefully solved!) issue. — George Armhold, Feb 21 '11 at 15:27
@Caffeine Coma I posted my Iterator you could probably do some more JPA adapting to it. Tell me if it helps. I ended up not using (did a findAll). — Adam Gent, Feb 21 '11 at 15:35

score 0 · Answer 16 · answered Feb 21 '11 at 15:40

0

Use Pagination Concept for retrieving result

answered Feb 21 '11 at 15:40

Dead Programmer

12,427
23
80
112

4

Pagination is very good for GUI's. But for processing huge amounts of data the ScrollableResultSet was invented a long time ago. It's just not in JPA. – extraneon May 05 '11 at 18:36

score 0 · Answer 17 · answered May 11 '23 at 10:49

My two cents in addition to the helpful accepted answer, we can use Spring Data JPA to handle pagination. Repository example:

public interface ModelRepository extends JpaRepository<Model, Long> {

    //feel free to change the order by criteria
    List<Model> findAllByOrderByUpdatedDesc(Pageable pageable);

}

And your service just do as initially suggested:

private void iterateAll()
{
    int offset = 0;

    List<Model> models;
    while ((models = modelRepository.findAllByOrderByUpdatedAsc(PageRequest.of(offset, 100)) ).size() > 0)
    {

        for (Model model : models)
        {
            log.info("do something with model: " + model.getId());
        }

        offset ++;
    }
}

Note that OrderByUpdatedAsc is likely preferable to OrderByUpdatedDesc if your table is being updated in parallel, although not necessarily immune to those sort of issues.

Additionally,

Page 537 of Java Persistence with Hibernate gives a solution using ScrollableResults, but alas it's only for Hibernate.

Beware of potential memory leaks using scrollable results, I've had problems like these -> Spring/Hibernate connection leak with ScrollableResults

Alireza Fattahi · Answer 18 · 2023-08-19T11:05:08.550

In JPA documnets (Version 3.3) there is a section about Scrolling:

Scrolling is a more fine-grained approach to iterate through larger results set chunks. Scrolling consists of a stable sort, a scroll type (Offset- or Keyset-based scrolling) and result limiting. You can define simple sorting expressions by using property names and define static result limiting using the Top or First keyword through query derivation. You can concatenate expressions to collect multiple criteria into one expression.

Scroll queries return a Window that allows obtaining the scroll position to resume to obtain the next Window until your application has consumed the entire query result. Similar to consuming a Java Iterator<List<…>> by obtaining the next batch of results, query result scrolling lets you access the a ScrollPosition through Window.positionAt(…).

As sample:

interface UserRepository extends Repository<User, Long> {

  Window<User> findFirst10ByLastnameOrderByFirstname(String lastname, OffsetScrollPosition position);

}

//Somewhere in your service
WindowIterator<User> users = WindowIterator.of( position -> repository.findFirst10ByLastnameOrderByFirstname("Doe", position))
                              .startingAt(OffsetScrollPosition.initial());

while (users.hasNext()) {
  User u = users.next();
  // consume the user
}

The Rocket · Answer 19 · 2023-04-14T09:19:26.680

All, Most of the blogs/solutions are dealing with MySQL or PostgreSQL instead of Oracle database when trying to read/retrieve from the large set of data and process by locking/update from multi-servers or multithreading using Spring JPA

My version is the latest : v2.7.8

As JPA Repository, doesn't support scrolling the data by default - I was using ROWNUM (i.e., Rownum <=1000) in the Queries for processing.

Most of them know the consequences of using Rownum from Oracle standpoint.

However, I will explain in short:

Whenever a value is assigned to Rownum and fire the SQL Query, mostly the same set is being retrieved. In this case, when I lock the rows with Spring JPA Native Query - other servers/threads leads to pick 0 records.

Its a nightmare & terrible sleepless nights in terms of performance & resource utilisation.

I have overcome these with some implementations on top of JPA Repository which saved me in all aspects.

Here is the solution for all the Database users using Spring JPA with locking (For Update Skip Locked):

1. Amend the JPA Repository interface:

@Repository
public interface SampleDataRepo extends JpaRepository<SampleObj, Id> , SampleDataRepository{
}

2. Create a new Interface declared above which extended JPA Repository:

public interface SampleDataRepository {

// Your Method
List<SampleObj> retrieveDataFromSample(
            @Param("someColumn") String someColumn);

}

3. Implement the above Interface:

public class SampleDataRepositoryImpl implements SampleDataRepository {

    @PersistenceContext
    private EntityManager entityManager;

// Your Method
@Override
List<SampleObj> retrieveDataFromSample(
            @Param("someColumn") String someColumn){

final String findDataBySomeColumnQuery = "SELECT r.ID FROM SAMPLE_OBJECT_TABLE WHERE r.SOME_COLUMN = :someColumn ORDER BY ID ASC FOR UPDATE SKIP LOCKED";

List<SampleObj> dataList = new ArrayList<>();

ScrollableResults result = entityManager.createNativeQuery(findDataBySomeColumnQuery, SampleObj.class)
        .setParameter("someColumn", someColumn)
                .setHint(QueryHints.HINT_FETCH_SIZE, 1000)
                .unwrap(Query.class).scroll(ScrollMode.FORWARD_ONLY);

        int limit = 1000;
        try {
            while (result.next() && limit > 0) {
                dataList.add((SampleObj) result.get(0));
                limit--;
            }
            return dataList;
        } finally {
            if (result != null) {
                result.close();
            }
        }
  }
}

4. Make a call to the above method from service class as earlier

@Service("SampleMonitor")
@Primary
public class SampleMonitorServiceImpl implements SampleMonitorService {

    @Autowired
    private SampleDataRepo sampleDataRepo;

    @Override
    @Transactional(readOnly = false, isolation = Isolation.READ_COMMITTED, rollbackFor = { Exception.class,
            RuntimeException.class })
    public void monitor(){

.......
.......

List<SampleObj> dataList = sampleDataRepo.retrieveDataFromSample(value);
.......

  }
}


Now, there are no Heap Size/outofmemory issue. Performance on topnotch with multiple threads/servers.


Happieee Coding.

please edit and format your answer: code as code with proper indentation, descriptive plain text _not_ as code. .. and don't post the exact same answer to multiple questions — kleopatra, Apr 14 '23 at 09:10
I humbly request to concentrate on the solution rather than the indentation. For the hardwork made to help somebody resulting with not useful mark is not justifiable. Hope you understand my intention. Thank You,. — The Rocket, Apr 14 '23 at 09:12
@kleopatra Hope you are happy now. Please make it useful for someone by marking positive. Thank You. — The Rocket, Apr 14 '23 at 09:23
yeah, you did improve it, thanks for starting :) - but still not good enough: no need to quote what you are saying (or did you copy it from somewhere? if so, you should attribute the source), the bullets in the code snippets are .. just text not code ;) — kleopatra, Apr 14 '23 at 09:41
@kleopatra It my own solution which I found out and implemented in my projects. Hope you respect that. I don't have to paste someone else solutions. — The Rocket, Apr 14 '23 at 09:42
good - then fix the formatting to not make it appear as code :) — kleopatra, Apr 14 '23 at 09:45
@kleopatra Sorry for the formatting issue. However, I'm busy. Thanks for your interest. — The Rocket, Apr 14 '23 at 09:56

JPA: what is the proper pattern for iterating over large result sets?

19 Answers19

Linked

Related