Difficult to advise without close knowledge of the system, so all I can offer is a general principal, based on approaches that I have successfully used. Warning: It's old school.
I've used pseudo-code in the code examples below. Hope that doesn't break any house rules.
Having a billion rows I would guess that the table has unique/highly selective index.
Assuming the table has a column called ID with a unique index, I would do the following:
SELECT MIN(ID) minid, MAX(ID) maxid, COUNT(1) numid FROM the_Table;
This gives us basic high level info to plan our attack.
Next, how many processes can you run at this? 7?
Then, assuming fairly even distribution:
thread_size = numid / processes
If thread_size
of frighteningly large, split it further it what would we manageable rows 100k. The aim is to get the index being used in the SELECT.
With that, write the process as follows:
process_batch ( batchStartID, batchEndID )
{
SELECT * FROM the_table WHERE id BETWEEN batchStartID and batchEndID ;
do_row_by_row_processing ( rowdata );
write_progress_info)every_1000_rows();
}
process_thread ( startID, endID, batchSize )
{
theID = startID;
while ( theID < endID )
{
process_batch ( theID, theID+batchSize);
write_thread_progress_info(theID, startID, endID);
theID = theID + batchSize + 1;
}
}
Each process_thread
instance is given a unique range and a batch size.
It processes (endID-startID)
rows in batches of batchSize
.
process_batch
will limit its processing to a relatively small number of rows, utilising the index. The returned results will be written to your files.
You then start you several process_threads with appropriate even ranges.
A benefit of this approach is that if there is a failure anywhere you can restart just within the failed range.
You also, get approximate progress markers on each iteration.
If using a language like Java in the extract app, I would recommend setting the Fetch Size to a sensible value (512) to reduce round-tripping to the database.