It seems that there are multiple ways to "multi thread read from a full table."
Zeroth way: if your problem is just "I run out of RAM reading that whole table into memory" then you could try processing one row at a time somehow (or a batch of rows), then process the next batch, etc. Thus avoiding loading an entire table into memory (but still single thread so possibly slow).
First way: have a single thread query the entire table, putting individual rows onto a queue that feeds multiple worker threads [NB that setting fetch size for your JDBC connection might be helpful here if you want this first thread to go as fast as possible]. Drawback: only one thread is querying the initial DB at a time, which may not "max out" your DB itself. Pro: you're not re-running queries so sort order shouldn't change on you half way through (for instance if your query is select * from table_name, the return order is somewhat random, but if you return it all from the same resultset/query, you won't get duplicates). You won't have accidental duplicates or anything like that. Here's a tutorial doing it this way.
Second way: pagination, basically every thread somehow knows what chunk it should select (XXX
in this example), so it knows "I should query the table like select * from table_name order by something start with XXX limit 10
". Then each thread basically processes (in this instance) 10 at a time [XXX is a shared variable among threads incremented by the calling thread].
The problem is the "order by something" it means that for each query the DB has to order the entire table, which may or may not be possible, and can be expensive especially near the end of a table. If it's indexed this should not be a problem. The caveat here is that if there are "gaps" in the data, you'll be doing some useless queries, but they'll probably still be fast. If you have an ID column and it's mostly contiguous, you might be able to "chunk" based on ID, for instance.
If you have some other column that you can key off of, for instance a date column with a known "quantity" per date, and it is indexed, then you may be able to avoid the "order by" by instead chunking by date, for example select * from table_name where date < XXX and date > YYY
(also no limit clause, though you could have a thread use limit clauses to work through a particular unique date range, updating as it goes or sorting and chunking since it's a smaller range, less pain).
Third way: you execute a query to "reserve" rows from the table, like update table_name set lock_column = my_thread_unique_key where column is nil limit 10
followed by a query select * from table_name where lock_column = my_thread_unique_key
. Disadvantage: are you sure your database executes this as one atomic operation? If not then it's possible two setter queries will collide or something like that, causing duplicates or partial batches. Be careful. Maybe synchronize your process around the "select and update" queries or lock the table and/or rows appropriately. Something like that to avoid possible collision (postgres for instance requires special SERIALIZABLE option).
Fourth way: (related to third) mostly useful if you have large gaps and want to avoid "useless" queries: create a new table that "numbers" your initial table, with an incrementing ID [basically a temp table]. Then you can divide that table up by chunks of contiguous ID's and use it to reference the rows in the first. Or if you have a column already in the table (or can add one) to use just for batching purposes, you may be able to assign batch ID's to rows, like update table_name set batch_number = rownum % 20000
then each row has a batch number assigned to itself, threads can be assigned batches (or assigned "every 9th batch" or what not). Or similarly update table_name set row_counter_column=rownum
(Oracle examples, but you get the drift). Then you'd have a contiguous set of numbers to batch off of.
Fifth way: (not sure if I really recommend this, but) assign each row a "random" float at insert time. Then given you know the approximate size of the database, you can peel off a fraction of it like, if 100 and you want 100 batches "where x < 0.01 and X >= 0.02" or the like. (Idea inspired by how wikipedia is able to get a "random" page--assigns each row a random float at insert time).
The thing you really want to avoid is some kind of change in sort order half way through. For instance if you don't specify a sort order, and just query like this select * from table_name start by XXX limit 10
from multiple threads, it's conceivably possible that the database will [since there is no sort element specified] change the order it returns you rows half way through [for instance, if new data is added] meaning you may skip rows or what not.
Using Hibernate's ScrollableResults to slowly read 90 million records also has some related ideas (esp. for hibernate users).
Another option is if you know some column (like "id") is mostly contiguous, you can just iterate through that "by chunks" (get the max, then iterate numerically over chunks). Or some other column that is "chunkable" as it were.