Fetching millions of records in java

Question

Very Open question, I need to write a java client that reads millions of records (let's say account information) from an Oracle database. Dump it into a XML and send it through webservices to a vendor.

What is the most optimized way to do this? starting from fetching the millions of records. I Went the JPA/hibernate route I got outofMemory errors fetching 2 million records.

Is JDBC better approach? fetch each row and build the XML as I go? any other alternatives?

I am not an expert in Java so any guidance is appreciated.

What if you created a stored procedure inside of oracle itself to generate the XML file? Then you could call the stored procedure from your java app and send XML file once it is written. A lot of times it better to let the database itself do the heavy data lifting. — Alex McKenzie, Apr 10 '14 at 20:30
Does it have to be one single XML document sent through webservices to the vendor? Is "chunking" (breaking it up into smaller files each containing, say, 100K lines) an option? — dcsohl, Apr 10 '14 at 20:30
@Coffee - It depends on the average size of a record, doesn't it? If each record is, say, 1KB on average, that's a couple of GB. If OP is trying to build an in-memory DOM representation of the XML, small wonder there's a memory problem. — Ted Hopp, Apr 10 '14 at 20:31
Take the SAX approach to building your XML and your method will scale to an arbitrary large number of records limited only by 1. disk space, and 2. how long you can lock the table/db for (if concurrent updates are expected), ie. imagine that you're writing your XML to an output stream. How many records you load at once into memory should be parametrized. — VH-NZZ, Apr 10 '14 at 20:33
Are planing to sending millions records in single xml file (single web servcie request) ? sounds like not good plan . Did the consumer knew that its gonna very big xml ? — Mani, Apr 10 '14 at 20:36
fetch the data in batches from db. A thousand at a time. Convert to xml and send it across. — bgth, Apr 10 '14 at 20:36
As far as I know the vendor is looking for one big XML, I am working with them to provide some sort of a streaming solution. At this point breaking it into chuncks is not an option but it could be possible depending on the vendor. Building the XML in the database apprach, Is this using XMLELEMENT? that means I have to manually comply with their XSD schema. I was able to compile the vendor's schema using jaxb. — user955165, Apr 10 '14 at 20:46

score 6 · Answer 1 · answered Apr 10 '14 at 20:46

6

We faced similar problem sometime back and our record size was in excess of 2M. This is how we approached.

Using any OR mapping tool is simply ruled out due to large overheads like creation of large POJOs which basically is not required if the data is to be dumped to an XML.
Plain JDBC is the way to go. The main advantage of this is that it returns a ResultSet object which actually does not contain all the results at once. So loading of entire data in memory is solved. The data is loaded as we iterate over the ResultSet
Next comes the creation of XML file. We create an XML file and opened than in Append mode.
Now in loop where we iterate over Resultset object, we create XML fragments and then append the same to the XML file. This goes on till entire Resultset is iterated.
In the end what we have is XML file will all the records.
Now for sharing this file, we created a web services which would return the URL to this XML file (archived/zipped) if the file is available.
The client could download this file anytime after this.
Note this this is not a synchronous system, meaning The file does not become available after the client makes the call. Since creating XML call takes a lot of time, HTTP wold normally timeout hence this approach.

Just an approach you can take clue from. Hope this helps.

answered Apr 10 '14 at 20:46

Santosh

17,667
4
54
79

I have successfully used Hibernate for large result sizes. There are no large POJOs involved. – Marko Topolnik Apr 10 '14 at 20:47
Yes. My concern was more towards necessity of that if data is to be dumped as XML, and not subject to any business operation which a POJO facilitates. – Santosh Apr 10 '14 at 20:51
@Santosh **On your ruling our ORM**: I disagree, especially if you're using JAXb on the same domain model to create your XML. **On your justification of JDBC over JPA**: You can (and should) paginate results in JPQL/CriteriaQuery, thus preventing from loading more than batches of reasonable sizes into memory. – VH-NZZ Apr 10 '14 at 21:25
@okiharaherbst, My observation of for a specific use case wherein the data is to be dumped to a file. JPA does not offer any significant advantage over JDBC (_in this particular case_). Think of efficiency/resource consumption/simplicity point of view. – Santosh Apr 11 '14 at 08:00
@Santosh One could arguably debate at length whether straight low level jdbc offers any edge over jpa or vice versa. I _claim_ that the efficiency/resource tradeoff should be taken with a grain of salt more often than not as vertically scaling nodes is _s i g n i f i c a n t l y_ cheaper than highly skilled labor nowadays. In that respect, I'd rather take the high level approach of _not_ fiddling with SQL when I can afford to do so and leave my brain in a more systemic mindset – which, I believe, is also encouraged by Java as a whole – VH-NZZ Apr 11 '14 at 08:13
@okiharaherbst its always debatable, not all the conditions are alike and hence there cannot be silver-bullet for all problem. – Santosh Apr 11 '14 at 11:03

score 3 · Answer 2 · edited May 23 '17 at 12:15

Use ResultSet#setFetchSize() to optimize the records fetched at time from database.

See What does Statement.setFetchSize(nSize) method really do in SQL Server JDBC driver?

In JDBC, the ResultSet#setFetchSize(int) method is very important to performance and memory-management within the JVM as it controls the number of network calls from the JVM to the database and correspondingly the amount of RAM used for ResultSet processing.

Read here about Oracle ResultSet Fetch Size

score 1 · Answer 3 · answered Apr 10 '14 at 20:31

For this size of data, you can probably get away with starting java with more memory. Check out using -Xmx and -Xms when you start Java.

If your data is truly too big to fit in memory, but not big enough to warrant investment in different technology, think about operating in chunks. Does this have to be done at once? Can you slice up the data into 10 chunks and do each chunk independently? If it has to be done in one shot, can you stream data from the database, and then stream it into the file, forgetting about things you are done with (to keep memory use in the JVM low)?

To chunk the data see http://stackoverflow.com/questions/12607400/something-like-cursor-or-recordset-iterator-in-hibernate-or-jpa — pscuderi, Apr 10 '14 at 20:39

score 1 · Answer 4 · edited Apr 03 '20 at 19:41

1

Read the records in chunks, as explained by previous answers.
Use StAX http://stax.codehaus.org/ to stream the record chunks to your XML file as opposed to all records into one large document

edited Apr 03 '20 at 19:41

TylerH

20,799
66
75
101

answered Apr 10 '14 at 21:01

Ove Sundberg

333
3
10

score 0 · Answer 5 · answered Apr 10 '14 at 20:44

As far as the Hibernate side is concerned, fetch using a SELECT query (instead of a FROM query) to prevent filling up the caches; alternatively use a statelessSession. Also be sure to use scroll() instead of list(). Configuring hibernate.jdbc.fetch_size to something like 200 is also recommended.

On the response side, XML is a quite bad choice because parsing is difficult. If this is already set, then make sure you use a streaming XML serializer. For example, the XPP3 library contains one.

VH-NZZ · Answer 6 · 2014-04-10T21:45:26.740

While a reasonable Java approach would probably involve a StAX construction of your XML in conjunction to paginated result sets (straightforward JDBC or JPA), keep in mind that you may need to lock your database for updates all the while which may or may not be acceptable in your case.

We took a different, database-centric approach using stored procedures and triggers on INSERT and UPDATE to generate the XML node corresponding to each row/[block of] data. This constantly ensures that 250GB+ of raw data and its XML representation (~10 GB) are up-to-date and reduces (no pun intended) the export to a mere concatenation matter.

score 0 · Answer 7 · answered Apr 10 '14 at 21:52

You can still use Hibernate to fetch millions of data, it's just that you cannot do it in one round because millions is a big number and of course you will have out of memory exception. You can divide it into pages and then dump to XML each time, so that the records won't be keep in RAM and your program would not be needing so huge of memory.

I have these 2 methods in my previous project that I used very frequently. Unfortunately I did not like to use HQL so much so I don't have the code for that.

So here INT_PAGE_SIZE is the amount of rows that you would like to fetch each round, and getPageCount is to get the amount of total rounds to do to fetch all of the records. Then paging is to fetch the records by page, from 1 to getPageCount.

public int getPageCount(Criteria criteria) {
    ProjectionList pl = Projections.projectionList();
    pl.add(Projections.rowCount());
    criteria.setProjection(pl);
    int rowCount = (Integer) criteria.list().get(0);
    criteria.setProjection(null);
    if (rowCount % INT_PAGE_SIZE == 0) {
        return rowCount / INT_PAGE_SIZE;
    }
    return rowCount / INT_PAGE_SIZE + 1;
}

public Criteria paging(Criteria criteria, int page) {
    if (page != -1) {
        criteria.setFirstResult((page - 1) * INT_PAGE_SIZE);
        criteria.setMaxResults(INT_PAGE_SIZE);
    }
    return criteria;
}

Fetching millions of records in java

7 Answers7

Linked