How to implement several threads in Java for downloading a single table data?

Question

How can I implement several threads with multiple/same connection(s), so that a single large table data can be downloaded in quick time.

Actually in my application, I am downloading a table having 12 lacs (1 lac = 100,000) records which takes atleast 4 hrs to download in normal connection speed and more hrs with slow connection.

So there is a need to implement several threads in Java for downloading a single table data with multiple/same connection(s) object. But no idea how to do this.

How to position a record pointer in several threads then how to add all thread records into a single large file??

Thanks in Advance

As a know Download Accelerator Plus (DAP) downloads a file is to open up parallel downloads. Is this technique possible in my case? — Kishore_2021, Nov 30 '11 at 12:29
Your question isn't very clear. What is this table you're downloading – is it a file on a web server, or a table in a database? How are you downloading it? — millimoose, Nov 30 '11 at 14:32
Web download accelerators work by using a very specific HTTP feature that enables requesting part of a file. There isn't any generic method to do a partial transfer over any internet protocol. — millimoose, Nov 30 '11 at 14:34
Your question isnt very clear. Where are you trying to download a file? What is your client? What is your server from where you are trying to download? Is your download a static file or a dynamically generated data? — Drona, Nov 30 '11 at 16:19
I am downloading a AS400 database table's records located on a server having very very large number of records by using JDBC connection, which takes so much time. Now need to implement several threads in Java for downloading a single large table data with multiple/same connection. How this possible in Java??? — Kishore_2021, Dec 01 '11 at 06:51

Drona · Accepted Answer · 2011-12-02T06:34:56.383

First of all, is it not advisable to fetch and download such a huge data onto the client. If you need the data for display purposes then you dont need more records that fit into your screen. You can paginate the data and fetch one page at a time. If you are fetching it and processsing in your memory then you sure would run out of memory on your client.

If at all you need to do this irrespective of the suggestion, then you can spawn multiple threads with separate connections to the database where each thread will pull a fraction of data (1 to many pages). If you have say 100K records and 100 threads available then each thread can pull 1K of records. It is again not advisable to have 100 threads with 100 open connections to the DB. This is just an example. Limit the no number of threads to some optimal value and also limit the number of records each thread is pulling. You can limit the number of records pulled from the DB on the basis of rownum.

Thanks Vikas. Actually my application is a Eclipse plug-in application, in which user can be downloaded all online data by a small RCP tool at once into DB2 Express-C database and can be view all the times by using plug-in and then no need to go online further. But no idea, How to use ROWID with multiple Threads. please give some hints on that. — Kishore_2021, Dec 02 '11 at 06:02
Refer to the following link for more information on implementing pagination. http://www.decipherinfosys.com/Paging_Data.pdf — Drona, Dec 02 '11 at 06:31

score 2 · Answer 2 · edited May 23 '17 at 11:45

It seems that there are multiple ways to "multi thread read from a full table."

Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).

First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.

Second way: pagination, basically every thread somehow knows what chunk it should select (XXX in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].

The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.

If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY (also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).

Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10 followed by a query select * from table_name where lock_column = my_thread_unique_key. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).

Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000 then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum (Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.

Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).

The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10 from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.

Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).

Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.

score 2 · Answer 3 · answered Dec 02 '11 at 04:06

2

As Vikas pointed out, if you're downloading a gigabytes of data to the client-side, you're doing something really really wrong, as he had said you should never need to download more records that can fit into your screen. If however, you only need to do this occasionally for database duplication or backup purpose, just use the database export functionality of your DBMS and download the exported file using DAP (or your favorite download accelerator).

answered Dec 02 '11 at 04:06

Lie Ryan

62,238
13
100
144

I agree. If you need to replicate/backup the data to a client side database, you should be using the export functionality. Export the data from server side download the dump file and import to the client DB. – Drona Dec 02 '11 at 06:33

score 0 · Answer 4 · answered May 09 '15 at 20:02

I just felt compelled to answer on this old posting.

Note that this is a typical scenario for Big Data, not only to acquire the data in multiple threads, but also to further process that data in multiple threads. Such approaches do not always call for all data to be accumulated in memory, it can be processed in groups and/or sliding windows, and only need to either accumulate a result, or pass the data further on (other permanent storage).

To process the data in parallel, typically a partitioning scheme or a splitting scheme is applied to the source data. If the data is raw textual, this could be a random sizer cut somewhere in the middle. For databases, the partitioning scheme is nothing but an extra where condition applied on your query to allow paging. This could be something like:

Driver Program: Split my data in for parts, and start 4 workers
4 x (Worker Program): Give me part 1..4 of 4 of the data

This could translate into a (pseudo) sql like:

SELECT ...
FROM (... Subquery ...)
WHERE date = SYSDATE - days(:partition)

In the end it is all pretty conventional, nothing super advanced.

I like this, or split the table up into 4 "known size" chunks and work on those in threads... — rogerdpack, Sep 15 '17 at 16:42

How to implement several threads in Java for downloading a single table data?

4 Answers4

Linked