3

Question: How can I process (read in) batches of records 1000 at a time and ensure that only the current batch of 1000 records is in memory? Assume my primary key is called 'ID' and my table is called Customer.

Background: This is not for user pagination, it is for compiling statistics about my table. I have limited memory available, therefore I want to read my records in batches of 1000 records at a time. I am only reading in records, they will not be modified. I have read that StatelessSession is good for this kind of thing and I've heard about people using ScrollableResults.

What I have tried: Currently I am working on a custom made solution where I implemented Iterable and basically did the pagination by using setFirstResult and setMaxResults. This seems to be very slow for me but it allows me to get 1000 records at a time. I would like to know how I can do this more efficiently, perhaps with something like ScrollableResults. I'm not yet sure why my current method is so slow; I'm ordering by ID but ID is the primary key so the table should already be indexed that way.

As you might be able to tell, I keep reading bits and pieces about how to do this. If anyone can provide me a complete way to do this it would be greatly appreciated. I do know that you have to set FORWARD_ONLY on ScrollableResults and that calling evict(entity) will take an entity out of memory (unless you're doing second level caching, which I do not yet know how to check if I am or not). However I don't see any methods in the JavaDoc to read in say, 1000 records at a time. I want a balance between my lack of available memory and my slow network performance, so sending records over the network one at a time really isn't an option here. I am using Criteria API where possible. Thanks for any detailed replies.

KyleM
  • 4,445
  • 9
  • 46
  • 78
  • side note, if anyone can point out the appropriate terminology for 'scrolling through records X at a time' in the comments, that'd be appreciated. I thought it was called batch processing but the Hibernate docs only mention updating & deleting in their batch section... – KyleM Mar 15 '13 at 04:16
  • It is a form of batch processing / pagination, it's just that it probably isn't a very common request. I had a need to do this and ended up just fetching all the PKs in memory and doing a whole lot of `where Id in (...)` queries. – millimoose Mar 15 '13 at 20:00
  • Also, just because your table is indexed by PK doesn't necessarily mean the records are internally ordered by PK. The table might use an extensible-hashing index. Even if that doesn't sound likely, B+-trees seem to be more common. Also, the query might filter using on a different index than the PK one at some point, which means it'd still need to re-sort the results afterwards. (E.g. if you're selecting records in a date range, they would be retrieved from the table ordered by date.) An EXPLAIN should clear this up. – millimoose Mar 15 '13 at 20:04
  • @millimoose The oracle documentation seems to say otherwise (about the PK always being index) – KyleM Mar 15 '13 at 20:25
  • I'm not saying the PK is not being indexed. I'm saying it might not be the index that's accessed first when a query is being executed. (Which would determine in which order the elements are in before ordering is applied.) – millimoose Mar 15 '13 at 21:17
  • @millimoose Gotcha. The other field in my where clause isn't indexed though, and I'm ordering by ID as I said, and ID is the PK. Why would it do anything other than use the already-sorted ID field? Not saying you're wrong, in fact you seem to be right judging by the slowness, I just don't get how to fix it. – KyleM Mar 15 '13 at 21:23

2 Answers2

1

May useing of ROWNUM feature of oracle will hepl you.
Lets say we need to fetch 1000 rows(pagesize) of table CUSTOMERS and we need to fetch second page(pageNumber)

Creating and Calling some query like this may be the answer

select * from 
  (select rownum row_number,customers.* from Customer 
   where rownum <= pagesize*pageNumber order by ID)
where row_number >= (pagesize -1)*pageNumber
Mohsen Heydari
  • 7,256
  • 4
  • 31
  • 46
  • 1
    If rows are added during this process, that could throw off rownum, correct? Something to be aware of. – Marvo Mar 15 '13 at 19:30
  • if its for some of your entities you may use Native Query, else must be implemented in your base DataAcess layer implementation. – Mohsen Heydari Mar 15 '13 at 19:39
  • @Marvo: No, Oracle manages the query result not to be interfering with currently added rows --- plz see "Statement-Level Read Consistency" at http://www.stanford.edu/dept/itss/docs/oracle/10g/server.101/b10743/consist.htm – Mohsen Heydari Mar 15 '13 at 19:42
  • 1
    @M.Heydari The title says it all. "Statement-Level". Fetching subsequent chunks would occur in separate statements. (Of course it's possible transactions would handle this well.) – millimoose Mar 15 '13 at 19:57
  • So the correctness of each query will be guaranteed by isolation level and correctness of overall statements when will be guaranteed when using order by ID – Mohsen Heydari Mar 15 '13 at 20:02
  • @M.Heydari it would maintain statement level read consistency during that query. But he's got umpteen number of queries to perform, 1000 records at a time. If you have ten records, process the first three in this manner, then insert a record that has an ID that would make it the second record, you would then process the third record for a second time on the next query. Unless he wraps the entire thing in a transaction, which might be prohibitive. – Marvo Mar 15 '13 at 20:03
  • It's possible that he's not performing updates to this data, in which case the point is moot. – Marvo Mar 15 '13 at 20:03
  • if the ID is not a sequence ... so there must be a TimeStamp field and the ORDER By must be based on it, .... is ID not a sequence? – Mohsen Heydari Mar 15 '13 at 20:08
  • Wouldn't matter if records can be deleted. – Marvo Mar 18 '13 at 19:50
  • Since the pivot element (id or time stamp) is incremental, No ... if record with id (1,2,3) exists so removing 2 wont change the ORDER BY result – Mohsen Heydari Mar 18 '13 at 19:55
0
  1. Load entities as read-only.

    For HQL

    Query.setReadOnly( true );
    

    For Criteria

    Criteria.setReadOnly( true );
    

    http://docs.jboss.org/hibernate/orm/3.6/reference/en-US/html/readonly.html#readonly-api-querycriteria

    Stateless session quite different with State-Session.

    Operations performed using a stateless session never cascade to associated instances. Collections are ignored by a stateless session

    http://docs.jboss.org/hibernate/orm/3.3/reference/en/html/batch.html#batch-statelesssession

  2. Use flash() and clear() to clean up session cache.

    session.flush();

    session.clear();

    Question about Hibernate session.flush()

  3. ScrollableResults should works that you expect.

    Do not forget that each item that you loaded takes memory space unless you evict or clear and need to check it really works well.

    ScrollableResults in Mysql J/Connecotr works fake, it loads entire rows, but I think oracle connector works fine.

    Using Hibernate's ScrollableResults to slowly read 90 million records

    If you find alternatives, you may consider to use this way

    1. Select PrimaryKey of every rows that you will process 
    2. Chopping them into PK chunk
    3. iterate -
       select rows by PK chunk (using in-query) 
       process them what you want
    
Community
  • 1
  • 1
dgregory
  • 1,397
  • 1
  • 12
  • 26
  • `@Transactional` is Spring specific, which isn't mentioned anywhere in post/tags. – Nayan Wadekar Mar 15 '13 at 04:45
  • What does loading entities as read only solve? How can I grab 1000 records at a time? Is it setting the fetch size? The fetch size section of the hibernate docs refers to associations not anything else. Or is it setFirstResult and setMaxResults? If it is then why would I use a ScrollableResults with it? – KyleM Mar 15 '13 at 12:30