2

While coding the solution to the problem of downloading a huge dynamic zip with low RAM impact, an idea started besieging me, and led to this question, asked for pure curiosity / hunger of knowledge:

What kind of drawbacks could I meet with if, instead of loading the InputStreams one at a time (with separate queries to the database), I would load all the InputStreams in a single query, returning a List of (n, potentially thousands, "opened") InputStreams ?

Current (safe) version: n queries, one inputStream instantiated at a time

for (long id : ids){
    InputStream in = getMyService().loadStreamById(id);
    IOUtils.copyStream(in, out);
    in.close();
}

Hypothetical version: one query, n instantiated inputStreams

List<InputStream> streams = getMyService().loadAllStreams();

for (InputStream in : streams){     
    IOUtils.copyStream(in, out);
    in.close();
    in = null;
}

Which are the pro and cons of the second approach, excluding the (I suppose little) amount of memory used to keep multiple java InputStream instantiated ?

Could it lead to some kind of network freeze or database stress (or lock, or problems if others read/write the same BLOB field the Stream is pointing to, etc...) more than multiple queries ?

Or are they smart enough to be almost invisible until asked for data, and then 1 query + 1000 active stream could be better than 1000 query + 1 active stream ?

Community
  • 1
  • 1
Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243

2 Answers2

3

The short answer is that you risk hitting a limit of your operating system and/or DBMS.

The longer answer depends on the specific operating system and DBMS, but here are a few things to think about:

  • On Linux there are a maximum number of open file descriptors that any process can hold. The default is/was 1024, but it's relatively easy to increase. The intent of this limit IMO is to kill a poorly-written process, as the amount of memory required per file/socket is minimal (on a modern machine).
  • If the open stream represents an individual socket connection to the database, there's a hard limit on the total number of client sockets that a single machine may open to a single server address/port. This is driven by the client's dynamic port address range, and it's either 16 or 32k (but can be modified). This limit is across all processes on the machine, so excessive consumption by one process may starve other processes trying to access the same server.
  • Depending on how the DBMS manages the connections used to retrieve BLOBs, you may run into a much smaller limit enforced by the DBMS. Oracle, for example, defaults to a total of 50 "cursors" (active retrieval operations) per user connection.

Aside from these limits, you won't get any benefit given your code as written, since it runs through the connections sequentially. If you were to use multiple threads to read, you may see some benefit from having multiple concurrent connections. However, I'd still open those connections on an as-needed basis. And lest you think of spawning a thread for each connection (and running into the physical limit of number of threads), you'll probably reach a practical throughput limit before you hit any physical limits.

parsifal
  • 646
  • 3
  • 4
  • +1, excellent answer, thanks. The benefit of the "hypotetical version" above would just consist in *one single query* instead of *a lot of queries*. Since a query cost some milliseconds, avoid querying the db thousands of times could save some seconds. Regarding `50 "cursor (active retrieval operations)`, it's the core of the discussion; when I call `rs.getBinaryStream("field")`, am I actually **activating** a cursor, or am I just **instantiating** it for a ***lazy activation*** later, when requesting data from it ? – Andrea Ligios May 24 '13 at 23:54
0

I tested it in PostgreSQL, and it works.

Since PostgreSQL seems to not have a predefined max cursor limit, I still don't know if the simple assignment of a cursor/pointer from a BLOB field to an Java InputStream object via java.sql.ResultSet.getBinaryStream("blob_field") is considered an active retrieval operation or not (I guess no, but who knows...);

Loading all the InputStreams at once with something like SELECT blob_field FROM table WHERE length(blob_field)>0 , produced a very long query execution time, and a very fast access to the binary content (in the sequential way, as above).


With a test case of 200 MB with 20 files of 10 MB each one:

  • The old way was circa 1 second per each query, plus 0.XX seconds for the other operations (reading each InputStream and writing it to the outputstream, and something else); Total Elapsed time: 35 seconds

  • The experimental way was circa 22 seconds for the big query, and 12 total seconds for iterating and performing the other operations. Total Elapsed time: 34 seconds

This makes me think that while assigning the BinaryStream from the database to the Java InputStream object, the complete reading is already being performed :/ making the use of an InputStream similar to an byte[] one (but worst in this case, because of the memory overload coming from having all the items instantiated);


Conclusion

reading all at once is a bit faster (~ 1 second faster every 30 seconds of execution), but it could seriously make the big query timing out, other than cause RAM memory leaks and, potentially, max cursor hits.

Do not try this at home, just stick with one query at once...

Andrea Ligios
  • 49,480
  • 26
  • 114
  • 243