4

I've got this requirement to iterate over millions of records returned in resultset and process them and store them in some data structure. I have not been able to find any relevant example or reference for the same. JOOQ seems to be doing something that I may want but it seems it is not free. I was hoping if using java 8 streams I might be able to achieve it, but not example or writeup seems to give me any directions to head towards. I am open to other alternatives as well.
Based on this SO reference: resultset parallel I did attempted below but it did not give me any performance improvement as could be seen below in the performance metrics.
CODE: Sequential Iteration:

while(rs.next()) {
    System.out.println(rs.getString(1));
    }

Using streams and spliterator:

Stream<String> s = StreamSupport.stream(new Spliterators.AbstractSpliterator<String>(Long.MAX_VALUE,
                Spliterator.ORDERED) {

            @Override
            public boolean tryAdvance(Consumer<? super String> action) {
                try {
                    if (!rs.next())
                        return false;
                    action.accept(rs.getString(1));
                } catch (SQLException e) {
                    // TODO Auto-generated catch block
                    e.printStackTrace();
                }
                return true;
            }
        }, true);
        s.forEach(System.out::println);

Total number of records: 3759
Time taken by Sequential: ~ 83.8 secs
Time taken by Streams: ~ 83.5 secs

Can anyone review this and tell me if I've not implemented streams correctly.

Community
  • 1
  • 1
Mrunal Gosar
  • 4,595
  • 13
  • 48
  • 71
  • 5
    Maybe this question and answer will help: http://stackoverflow.com/a/32232173/4105457 – Flown Oct 23 '16 at 08:24
  • 1
    I'm having trouble connecting your title and your question. – user207421 Oct 23 '16 at 09:01
  • There are so many result set processing examples out there, so I am supposing your problem is somewhere else? I think you should be able to make your question much more specific. For now, I think it’s too unclear to answer. – Ole V.V. Oct 23 '16 at 09:25
  • @Flown Earlier I did try the solution given in the SO reference which you've suggested above. But that did not yield any performance improvement. Updated OP with code that i tried and results. – Mrunal Gosar Oct 24 '16 at 09:14
  • 1
    Note that jOOQ is free if you use an open source database. Parallel processing has some overhead with streams, and a JDBC driver will perform a lot of synchronization to make a connection thread safe which will likely make your code serial anyway; you might want to consider retrieving the value from the database on a single thread, and only then parallelize it. – Mark Rotteveel Oct 24 '16 at 09:21
  • Can the downvoter also care to explain reason for downvote – Mrunal Gosar Oct 24 '16 at 09:23
  • @MarkRotteveel I am using Oracle Db and JOOQ is paid for Oracle DB, also to give you some more insight I am loading certain records in in-memory data structure from DB. Apparently using sequential approach to load millions of data in memory is really time consuming and non productive, hence I was thinking of utilising the servers multi core architecture to parallelize this loading activity and save time. Would you be able to suggest any alternate approach If any – Mrunal Gosar Oct 24 '16 at 09:34
  • Look into a producer/consumer setup: have a single producer get items from the result set and put it in - for example - a `BlockingQueue` and have multiple consumers get items from this queue to process them in parallel. – Mark Rotteveel Oct 24 '16 at 09:42
  • @MarkRotteveel In that case if producer is producing items one at a time (in this case reading records from DB), then no matter how many consumer that I spawn it will still be slow isn't it? – Mrunal Gosar Oct 24 '16 at 09:49
  • That depends on how much time the processing takes, but yes you might gain very little. On that other hand one thread producing and one thread consuming might already be faster than one thread doing all the work. You won't know for sure until you tried it. But I can almost guarantee it will be faster than trying some kind of parallel stream. – Mark Rotteveel Oct 24 '16 at 10:09
  • Mrunal's question was a good one and the same one as mine. Why the downvotes?? I am upvoting. – Geyser14 May 19 '17 at 20:02

1 Answers1

8

A ResultSet can not be processed in parallel. It is a kind of iterator, incorporating mutable state that must be altered for a query, most notably, the ResultSet has a current row to which it must be moved, before it can read. Even for the values within the row, which are accessed by index, the specification makes no thread safety guarantees and mentions the possibility that reading them out of order might not be supported by the underlying database.

So, the only operations that could benefit from parallel processing are chained subsequent operations, but when the only chained operation is a System.out::println, you make it worse. Not only does the print operation not benefit from parallel processing, all standard implementations have a PrintStream in System.out that will synchronize every write operation to the target output.

Note that even if you chain computational intense operations, which might benefit from parallel processing, it is still possible that the expensive database operation dominates the overall execution time. That’s why it is important to let the database filter and/or aggregate data as much as possible, before transferring them to the Java side…

Holger
  • 285,553
  • 42
  • 434
  • 765
  • :( well that leads me to suffer with whatever solution I have now. I would still keep this question open and see if anyone is able to suggest any hack / workaround. But thanks for your response and explanation. – Mrunal Gosar Oct 24 '16 at 13:06
  • You can, however, open multiple JDBC connections and split your query up in multiple independent queries which you can then execute in parallel. Some experimentation will most likely allow you to reach the point where either the database or the network connection is the bottleneck. – Thorbjørn Ravn Andersen Dec 25 '18 at 22:31
  • 1
    @ThorbjørnRavnAndersen that goes into the “let the database do as much as possible” direction, as it implies that the database is responsible for splitting the data into chunks (if not the programmer is capable of dividing that single query statement into multiple *different* statements even before that). Unless you have one of those really big distributed database systems, creating multiple statements (or even multiple connections) out of the one, creates more overhead than the entire bulk data transfer operation. – Holger Jan 07 '19 at 08:29
  • @Holger Perhaps. Depends on the actual usage, but as OP wanted to have better performance the way to go here is parallel and split the work. – Thorbjørn Ravn Andersen Jan 07 '19 at 11:48