0

I have a simple concurrency script that uses Oracle for the main database. I'm listening into SQL with python-oracledb and have used individual threads to open up a database connection. However, I have recognised that this is very slow by comparison for larger tables. For example, for table with a table size of 1.2 million rows, the programme hangs and my IDE crashes. Whereas, with row sizes of 80,000, it completes in a few seconds. What is causing this overflow and how to correct this?

SQL =  'SELECT /*+ ENABLE_PARALLEL_DML PARALLEL(AUTO) */ * FROM USER_TABLE offset :rowoffset rows fetch next :maxrows rows only'

MAX_ROWS = 1200000
NUM_THREADS = 12
def start_workload(fn):
    def wrapped(self, threads, *args, **kwargs):
        assert isinstance(threads, int)
        assert threads > 0

        ts = []
        for i in range(threads):
            new_args = (self, i, *args)
            t = threading.Thread(target=fn, args=new_args, kwargs=kwargs)
            t.start()
            ts.append(t)
        for t in ts:
            t.join()
    return wrapped

import pandas as pd
class TEST:
    def __init__(self, batchsize, maxrows, *args):
        self._pool = oracledb.create_pool(user = args[0], password = args[1], port=1521,host="localhost", service_name="service_name", min=NUM_THREADS, max=NUM_THREADS)
        self._batchsize = batchsize
        self._maxrows = maxrows
        
    @start_workload
    def do_query(self, tn):
        with self._pool.acquire() as connection:
            with connection.cursor() as cursor:
                max_rows = self._maxrows
                row_iter = int(max_rows/self._batchsize)
                cursor.arraysize = 10000
                cursor.prefetchrows = 1000000
                cursor.execute(SQL, dict(rowoffset=(tn*row_iter), maxrows=row_iter))
                columns = [col[0] for col in cursor.description]
                cursor.rowfactory = lambda *args: dict(zip(columns, args))
                pd.DataFrame(cursor.fetchall()).to_csv(f'TH_{tn}_customer.csv')

if __name__ == '__main__':
    result = TEST(NUM_THREADS,MAX_ROWS,username, password)
    import time
    start=time.time()
    Make = result.do_query(NUM_THREADS)
    end=time.time()
    print('Total Time: %s' % (end-start))
    print(Make)
MT0
  • 143,790
  • 11
  • 59
  • 117
Emil11
  • 199
  • 9
  • `I have a simple concurrency script` there's no such thing. You can't make bad access code go faster by using more threads, you'll do the exact opposite. You'll end up with exponentially worse performance as multiple connections start blocking each other. In this case, using multiple threads end up competing for the same server CPUs, same RAM, same disk IO and worse, controls all of them through slow client-side code – Panagiotis Kanavos Mar 06 '23 at 08:12
  • What are you trying to do in the first place? The only relevant part seems to be `to_csv`. Are you trying to export data to CSV files? 80K is no data at all, it easily fits into the CPU's cache. 1.2M rows is *some* data, not a lot and definitely not big data. Oracle, like all databases, has native tools to import or export data to text files – Panagiotis Kanavos Mar 06 '23 at 08:15
  • @PanagiotisKanavos I'm only attempting to read oracle tables with Python and output them as chunks into a .csv file; I am opting for this path because I may want a queried table. However, I find its far too time consuming for 1.2m rows. – Emil11 Mar 06 '23 at 08:16
  • Read to do *what*? You don't read tables for the sake of reading. Searching a 1.2M row to return 1 or 100 rows is fast if the query is well written and the table has the proper indexes. The only reason to read all rows is either a bad query or exporting. In most cases you can write a SQL query to return the sums, averages or subset of information you want – Panagiotis Kanavos Mar 06 '23 at 08:17
  • @PanagiotisKanavos You're misunderstanding me, I never explained the purpose to read these tables. That's something else entirely, my question is related to the speed with the current script. It's currently reading 120k rows at a time, but a 120k row table reads under 20 seconds with my script. However, it's been hanging for more than 5 minutes. – Emil11 Mar 06 '23 at 08:20
  • Yes, it's slow because what you're doing is inherently slow, the slowest way possible of doing anything with databases. So what are you actually trying to do? `I never explained the purpose to read these tables.` precisely. What you're doing would never work. – Panagiotis Kanavos Mar 06 '23 at 08:22
  • `You're misunderstanding me,` on the contrary, I'm working with some 40M row tables right now. There are several SO questions from people that tried to accelerate their slow queries or ETL jobs by running more of them, resulting in far worse performance. In this case, *paging* results in incrementally worse performance for every batch. When you use `OFFSET 30000 FETCH NEXT 1000` the server *must* find the first 30000 rows so it can skip them and get the next 1000. By running 12 such queries concurrently, you're executing the same query 12 times, scanning the same table 12 times – Panagiotis Kanavos Mar 06 '23 at 08:24
  • @PanagiotisKanavos that would explain why it's extremely slow, and I was worried that's what was happening. What do you propose to improve my script, or would concurreny not be possible with python? – Emil11 Mar 06 '23 at 08:28
  • Impossible to answer without knowing what you're trying to do. It looks like you're trying to export 1M rows. Why aren't you using Oracle's tools? Oracle can even treat CSVs as external tables, in which case you could `INSERT` into that CSV the results of a query. Check [this Ask TOM question](https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:9536328100346697722) where the post asks how to export 100M rows, not just 1M. – Panagiotis Kanavos Mar 06 '23 at 08:32
  • You can't improve performance if you don't know what's actually slow either. If you try to export data using a client-side script it's quite likely the bottleneck is the network. Running more queries won't improve network IO. Then, using Pandas, is inherently slow - Pandas will read everything in memory before writing them to disk. That's a waste of both RAM and time, since all you really want is to write those rows to disk as they arrive. – Panagiotis Kanavos Mar 06 '23 at 08:34
  • If the query is complex and slow (which `SELECT * FROM SomeTable` isn't) it may make sense to calculate just the ID/PK values of the results, store those in a temporary table and perform batching on that table. Then again, [as this SO question shows](https://stackoverflow.com/questions/4613465/using-python-to-write-mysql-query-to-csv-need-to-show-field-names) you may be able to use `rows = cursor.fetchall(); fp = open('/tmp/file.csv', 'w'); myFile = csv.writer(fp); myFile.writerows(rows); fp.close()` – Panagiotis Kanavos Mar 06 '23 at 08:38
  • [Selecting from an Oracle Database table in parallel using Python](https://medium.com/oracledevs/selecting-from-an-oracle-database-table-in-parallel-using-python-31ecaa2c28c8) has some relevance. If you're just fetching to write data out (and you do want to use Python), then do it natively in python-oracledb without the overhead of Pandas. Some of your problem may be that you are trying to hold a lot of data (MAX_ROWS x NUM_THREADS) in Python's memory all at once and this is causing Python to crash. – Christopher Jones Mar 06 '23 at 11:31

2 Answers2

2

Depending on the setup of your database, the query might be running in parallel (because of the hint). Without careful configuration, you might run into trouble with this if you execute many threads and the table is very large - because you could use up all your parallel servers (this depends on DB config). Perhaps that's what you're seeing.

If reading the database using multiple threads, I'd generally arrange it so that each thread reads serially (or maybe a low degree like 2 or 3). There's no single best choice because the performance depends on many factors, so I'd experiment. For example, it will depend on how many CPU cores you have and how fast the DB read is vs the bandwidth of the filesystem receiving the CSV files. If the write rate of your code is incredibly high, you might find it works best with one or two threads, with each thread running a PARALLEL 8 query in the database. Having said that, serial is easy, and it should perform consistently. Readers don't block one another in Oracle, and writers generally don't block readers either. You won't need to get into configuring parallel execution servers if you use serial queries. The other consideration I have is that if the 'reading query' needs to be more complex, a lower thread count and higher PARALLEL(N) might deliver better performance.

As for your query - there's no need for the DML hint. You probably don't need a hint at all, but if you want to control parallelism explicitly, you can use:

/*+ NO_PARALLEL / or, if you want to try parallel exec, /+ PARALLEL(2) */

Your use of prefetch rows and array size is good. I have tested your code and it works as I would expect with these settings (i.e. rows are being fetched in batches). I would probably choose to make the values the same so that the prefetch and subsequent fetches are all the same size. The effect of increasing the value will be more pronounced if you are connecting over a network, and even more so if there's high latency. There will come a point when increasing the values doesn't make much difference, and this point will vary depending on your infrastructure. I'd usually set it to 1000 or 10000 and call that 'good enough.'

It's been mentioned in an earlier answer - the OFFSET mechanism is not necessarily the most efficient way to read the data. With multiple threads, the table data is not read once: there is some overlap because the DB must visit the blocks/rows in the OFFSET even though they are not returned to the client. Partitioning the table and reading one partition per thread is usually the most efficient solution. This makes your DB design more complicated and it has licensing implications, so I'd check whether the solution you have is fast and efficient enough without having to explore that road. Oracle Exadata has some efficiency tricks which can make it unnecessary to partition, but that's a story for another day.

When I run the script stand-alone, it doesn't exit even though it has reached "Print(Make)." I'm not sure how relevant that is to you, but it looks like the connection pool needs to be closed before it will exit.

This is an interesting topic and there are many variables. I would aim to keep it simple with serial queries and N threads. Maybe I'd change this if I really needed to squeeze out maximum performance and efficiency but bear in mind that if you find a sweet spot, it will move if your infrastructure or query changes. Good luck with your project.

0

Emil11, Panagiotis' explanation in comments is accurate. Here's what you might want to try:

  1. Partition the table by thread number (add that col to the table. you can also use MOD against some unique or near-unique column as a virtual column and partition by the virtual column. Unleash each thread against its own proper partition. This is the most optimal method. You could make it more than 12, though communicate with your DBA about how many concurrent direct path reads they think is reasonable for your system.

  2. Don't use paging at all. Write data to disk by fetching a certain number of rows, write, fetch next block of rows, write, etc... this is execute once, fetch multiple times, not execute multiple times with paging predicates in the SQL. Also, don't pull all the data into array memory, or you have an unscalable app. If you do this, you can probably get away with not partitioning the table (though that would be optimal) as long as the thread count isn't too high. Use some MOD function to pull an even number of rows per thread.

  3. Recent versions of Oracle should use serial direct path, but that doesn't always work. Have your DBA examine the waits for your activity and see if they are "direct path read" or "scattered read". If you are getting scattered reads, you are going through the buffer cache and that' inefficient for data dumping like this. Try adding a tiny bit of parallelism ( SELECT /*+ parallel(2) */ ... ) to encourage direct path.

  4. Make sure your app fetches in blocks of rows (like 10,000 or 100,000 rows at a time), not 1 row at a time. All rows will blow out your memroy, 1 row will add an enormous amount of network time inefficiency. There's a happy medium.

Paul W
  • 5,507
  • 2
  • 2
  • 13