2

There's a similar question about streaming large results but the answer just points at docs and no clear answer emerges. I believe that merely treating a full result set as a stream still takes a lot of memory on the jdbc driver side..

I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.

And in particular I am not sure why setFetchSize(Integer.MIN_VALUE) is a very good idea, as it seems far from optimal if that means each row is sent on its own on the wire.

I believe libraries like jooq and slick already take care of that... and am curious as to how to accomplish it with and without them.

Thanks!

Community
  • 1
  • 1
matanster
  • 15,072
  • 19
  • 88
  • 167
  • P.S. there's something somewhat involved and specific to postgres as much as it concerns using slick: http://stackoverflow.com/questions/31340507/what-is-the-right-way-to-work-with-slicks-3-0-0-streaming-results-and-postgresq – matanster Mar 08 '16 at 21:38
  • The documentation (http://dev.mysql.com/doc/connector-j/en/connector-j-reference-implementation-notes.html) says that either everything is loaded in memory, or rows are retrieved one by one. Other database drivers have saner approaches. – JB Nizet Mar 08 '16 at 21:39
  • Mmmm I'm not sure it is really explicit in that documentation that these are the only options.. are you sure? – matanster Mar 08 '16 at 21:41
  • 1
    No, I'm not. But if there is another option, that's where it should be discussed. There's also the `useCursorFetch` property documented here (http://dev.mysql.com/doc/connector-j/en/connector-j-reference-configuration-properties.html), but it's not clear to me what it exactly means, and if the actual value of fetchSize is used for anything. – JB Nizet Mar 08 '16 at 21:50
  • Same here, thanks, that's why I wonder if there's a best practice for this that's already been proven. – matanster Mar 08 '16 at 21:54
  • 1
    The recent related question [here](http://stackoverflow.com/q/35346229/2144390) may be of interest, specifically with regard to the JDBC side of working with MySQL. – Gord Thompson Mar 08 '16 at 22:02
  • @JBNizet re: "either everything is loaded in memory, or rows are retrieved one by one" - The testing I did for [this answer](http://stackoverflow.com/a/35412659/2144390) suggests that the documentation has oversimplified the description of how a ResultSet is streamed. – Gord Thompson Mar 08 '16 at 23:19
  • @Gord good to know, thank you – JB Nizet Mar 09 '16 at 06:40
  • I'm considering to vote to close as duplicate of http://stackoverflow.com/questions/35346229/does-mysql-connector-j-buffer-rows-when-streaming-a-resultset but I am a bit on the fence as it is not entirely the same question. OP, do you think Gord's answer on that question answers yours? – Mark Rotteveel Mar 09 '16 at 08:26
  • @GordThompson many thanks for pointing out your reserach on it. If I were to follow your answer there, the answer here would be "let the MySQL driver automatically take care of the buffering for you", even if the somewhat lacking official documentation makes it all a little uncomfortable. – matanster Mar 09 '16 at 09:28
  • @MarkRotteveel I am not sure whether there might be an alternative answer as well, so I humbly think this can stay open. – matanster Mar 09 '16 at 09:29
  • 1
    For the record (since you tagged this question with [tag:jooq]), jOOQ doesn't and shouldn't help you here. Streaming the responsibility of the server network protocol. JDBC is the protocol abstraction API. jOOQ is a SQL language abstraction API, and thus shouldn't interfere with JDBC. While jOOQ does support `java.util.stream.Stream` for streaming results, this high-level abstraction makes no assumptions on how records are transferred over the wire. – Lukas Eder Mar 09 '16 at 20:21
  • @LukasEder thanks a lot for the time-saving tip. – matanster Mar 09 '16 at 23:03

1 Answers1

1

I am wondering if there's any clear cut pattern, or best practice, for making it work, especially on the jdbc driver side.

The best practice is not to do synchronous streaming but rather fetch in moderate size chunks. However avoid using OFFSET (also see). If your doing a batch process this can be facilitated by first selecting and pushing the data into a temporary table (ie turn your original results you want into a table first and then select chunks from the table... databases are really fast at copying data internally).

Synchronous streaming in general does not scale (aka iterator). It does not scale well for batch processing and it certainly does not scale for handling loads of clients. This is why the drivers vary and do so many different things because its fairly difficult to figure out how much resources to load because it is a pull model. Async streaming (push model) would probably help but unfortunately the JDBC standard does not support async streaming.

You might notice but this is one of the reasons why many of the wrappers around JDBC such as Spring JDBC do not return Iterators (along with fact that the resource also needs to be manually cleaned up). Some of the wrappers provide iterators but really they just turn the results into a list .

Your link to the Scala version is rather disturbing that its upvoted given the stateful nature of managing a ResultSet... its very un-Scala like... I'm not sure those folks know they have to consume the iterator or close the connection/ResultSet properly which requires a fair amount of imperative programming.

While it may seem inefficient to let the database decide how much to buffer just remember that most database connections are extremely heavy memory wise (at least on postgres they are). So if you take a long time streaming and have many clients your going to have to create more connections and put serious burden on the database. Not to mention the default buffers have probably been highly optimized (ie the resultset size that client ends up with).

Finally for batch processing chunks can be done in parallel which is obviously more efficient than a synchronous pipeline as well as being restarted (with out having to rework already processed data) if a problem occurs.

Adam Gent
  • 47,843
  • 23
  • 153
  • 203