0

I have a really huge table with billion of records and I need to read all the data from this table. It is desired to use several threads to fetch data over network connection. As long as this operation takes time, it would be nice if fetching application can display a kind of like progress bar.

I found that standard approach is not effective because it takes Oracle long time to find the beginning of the page. The following is an example of slow query:

select col1 
from (select rownum as r, col1 FROM table1 where rownum<20010000) 
where r>20000000

Is there any faster way to fetch data from such a table?

Anthony
  • 12,407
  • 12
  • 64
  • 88
  • 2
    why do you need to read all billion rows? you as a person do not, and could not...so i guess what i'm asking is, what are you going to really do with all that data? – thatjeffsmith Jan 03 '19 at 18:29
  • I need to transform it and load into another database – Anthony Jan 03 '19 at 18:33
  • 1
    how are you going to transform it, and are you going to put it into another oracle database? – thatjeffsmith Jan 03 '19 at 18:41
  • The application takes care about transformation and data is loaded into non-Oracle database. – Anthony Jan 03 '19 at 18:42
  • 1
    i was gonna say do the transformations in plsql in the db - will prob be faster there - and then you could use utl file to write delimited text files for the data to the server file system – thatjeffsmith Jan 03 '19 at 18:56
  • i'd also say going forward, store this data in oracle as an external table - a file(s) on disk, that way there's no 'extracting it' - you already got it there. and of course as soon as you get this data out, it will probably change – thatjeffsmith Jan 03 '19 at 19:06

1 Answers1

1

Difficult to advise without close knowledge of the system, so all I can offer is a general principal, based on approaches that I have successfully used. Warning: It's old school.

I've used pseudo-code in the code examples below. Hope that doesn't break any house rules.

Having a billion rows I would guess that the table has unique/highly selective index.

Assuming the table has a column called ID with a unique index, I would do the following:

SELECT MIN(ID) minid, MAX(ID) maxid, COUNT(1) numid FROM the_Table;

This gives us basic high level info to plan our attack.

Next, how many processes can you run at this? 7?

Then, assuming fairly even distribution:

thread_size = numid / processes

If thread_size of frighteningly large, split it further it what would we manageable rows 100k. The aim is to get the index being used in the SELECT.

With that, write the process as follows:

process_batch ( batchStartID, batchEndID )
{
   SELECT * FROM the_table WHERE id BETWEEN batchStartID and batchEndID ;
   do_row_by_row_processing ( rowdata );
   write_progress_info)every_1000_rows();
}

process_thread ( startID, endID, batchSize )
{
   theID = startID;
   while ( theID < endID )
   {
      process_batch ( theID, theID+batchSize);
      write_thread_progress_info(theID, startID, endID);
      theID = theID + batchSize + 1;
   }
}

Each process_thread instance is given a unique range and a batch size.

It processes (endID-startID) rows in batches of batchSize.

process_batch will limit its processing to a relatively small number of rows, utilising the index. The returned results will be written to your files.

You then start you several process_threads with appropriate even ranges.

A benefit of this approach is that if there is a failure anywhere you can restart just within the failed range.

You also, get approximate progress markers on each iteration.

If using a language like Java in the extract app, I would recommend setting the Fetch Size to a sensible value (512) to reduce round-tripping to the database.

TenG
  • 3,843
  • 2
  • 25
  • 42