3

I have a oracle server from where i need to extract data using python into files. These files are used by downstream systems as inputs.

Few technical details: Oracle and Python are running on different server. The database is client hosted while all scripts are running on an AWS RHEL EC2 server. Details of the EC2 instance are highlighted in this screenshot. screenshot.

My Approach To do this i choose Python's Pyodbc library to connect to the Remote Oracle client and extract data using the SQL query. Below is an extract of the code that fetches the data based on SQL Query provided.

def fetch_data_to_file(self,curr,query,dataset):
    try:
        self.logger.info('Executing query: {}'.format(query))
        start = time.time()
        curr.execute(query)
        query_time = time.time()-start
        start = time.time()
        rowcount=0
        with open(dataset,'a+') as f:
            writer = csv.writer(f,delimiter='|')
            writer.writerow([i[0] for i in curr.description])
            self.logger.info('Writing file: {}'.format(dataset))
            while True:
                rows = curr.fetchmany(self.batch_limit)
                self.logger.info('Writing {} rows'.format(self.batch_limit))
                rowcount+=len(rows)
                if not rows:
                    break           

        self.timer.info('{} Query Execution Time: {} seconds'.format(dataset,query_time))
        self.timer.info('{} File Writing Time: {} seconds. at {} rows/second'.format(dataset,time.time()-start,int(rowcount / (time.time()-start))))

        self.logger.info("File written.")
    except Exception as e:
        self.error.info("Error in fetching data.Error: {}".format(e))
        raise SystemExit(1)

The dataset that I am extracting is close to 8GB uncompressed size (close to 35Million rows returned). And it takes the code ~1.5 hours to download the file on my EC2 server. I tested with multiple variations of batch_limits and found 1Million - 2Million to be the optimal size to batch download data however I am unsure if there is something else i can do more efficiently to figure out what my batch size should be.

What else have i looked into I was looking online to figure out ways to write large datasets to files using python, and many suggested using Pandas. I tried to figure that out but failed to do so. Also, it is important that i preserve the data and its datatypes while extracting them to files.

My ask here is: is there anything i can do better to make this code more efficient ? Is Python the best suited language for this? (Please note, i need to be able to automate the jobs whatever language i choose. Going for licensed libraries is a bit difficult at this point due to internal pricing challenges at my firm).

Also, not sure if this helps, but here is a snapshot of my memory usage while the code was downloading data (htop)enter image description here enter image description here

APC
  • 144,005
  • 19
  • 170
  • 281
Rhythem Aggarwal
  • 346
  • 2
  • 15
  • Pandas has a limit of 200MB, so it's not going to help. What is the final destination of the file? Are these "downstream systems" on the EC2 server? Are these system Oracle, other RDBMS or something else? Are the extract files CSV or some other format? – APC May 15 '19 at 11:35
  • Finally the files are fed to multiple systems including database(Redshift) and reporting solutions(Microstrategy). And 2 files are posted on FTP for other vendors to consume as per their requirement. – Rhythem Aggarwal May 15 '19 at 13:31
  • The extracts need to be PIPE delimited specifically due to the nature of how other systems consuming the files are setup. I can consider having a different delimiter but only if it is absolutely necessary because that will involve convincing multiple teams to change their modules. – Rhythem Aggarwal May 15 '19 at 13:32
  • So does it really matter whether the files are written to the EC2 server? Because it could be more efficient to write them local to the database and distribute them from there. The actual delimiter doesn't matter. Just establishing that it was a delimited text file rather than (say) XML, JSON or something proprietary (Excel, SQL insert statements). – APC May 15 '19 at 13:33
  • @APC i might have understood you incorrectly. Are you suggesting i write the data directly to a table instead of files? something like ORACLE -> Redshift? – Rhythem Aggarwal May 15 '19 at 13:34
  • I'm not suggesting anything at the moment. I don't know nearly enough about your system architecture to make suggestions. But we know network I/O is a bottleneck when shunting data from server to server. So it might be more efficient to export the data to a file local to the database (say using PL/SQL or SQL\*Plus capability), zip it there and ship the zip to the downstream systems. But, depending on your downstream systems it might not. – APC May 15 '19 at 13:40
  • Oh i get it now. Exporting data to a server Oracle is running on might be difficult at the moment since that would required conversations between the vendor and their IT Team. The oracle database is managed by an outside team to my firm – Rhythem Aggarwal May 15 '19 at 14:31
  • Where do you tune [`cursor.arraysize`](https://cx-oracle.readthedocs.io/en/latest/cursor.html#Cursor.arraysize)? This will have a big impact on cx_Oracle fetch performance, particularly if Python is on a different machine to the DB. – Christopher Jones May 23 '19 at 02:18
  • i have a variable in python which is set to 100000 at the moment and am using it as arraysize in the fetchmany function. – Rhythem Aggarwal May 27 '19 at 05:09

3 Answers3

4

One possibility would be to download the free "SQLcl" utility, basically a Java based SQL-Plus but does much more. Download here. What you can do with SQLcl is drop it on the client machine and use it to extract the data, while also taking care of the delimiter for you. Here I'm setting the delimiter to the pipe symbol. This may be more efficient than trying to do it via Python, and you'd still be able to script it and call it from Python, or wherever.

$ sqlcl username/password@'<hostname>:<port>/ORACLE_SID'
> Set sqlformat delimited |
> Spool <some file to hold the data>
> Select * from <some table>;
> Spool off

And the above can easily be dropped into a shell script.

#!/bin/bash

sqlcl username/password@'<hostname>:<port>/ORACLE_SID' <<EOF
Set sqlformat delimited |
Spool <some file to hold the data>
Select * from <some table>;
Spool off
EOF

Example

sqlcl> Select * from emp;
"EMPNO"|"ENAME"|"JOB"|"MGR"|"HIREDATE"|"SAL"|"COMM"|"DEPTNO"
7839|"KING"|"PRESIDENT"||17-NOV-81|5000||10
7698|"BLAKE"|"MANAGER"|7839|01-MAY-81|2850||30
7782|"CLARK"|"MANAGER"|7839|09-JUN-81|2450||10
7566|"JONES"|"MANAGER"|7839|02-APR-81|2975||20
7788|"SCOTT"|"ANALYST"|7566|09-DEC-82|3000||20
7902|"FORD"|"ANALYST"|7566|03-DEC-81|3000||20
7369|"SMITH"|"CLERK"|7902|17-DEC-80|800||20
7499|"ALLEN"|"SALESMAN"|7698|20-FEB-81|1600|300|30
7521|"WARD"|"SALESMAN"|7698|22-FEB-81|1250|500|30
7654|"MARTIN"|"SALESMAN"|7698|28-SEP-81|1250|1400|30
7844|"TURNER"|"SALESMAN"|7698|08-SEP-81|1500|0|30
7876|"ADAMS"|"CLERK"|7788|12-JAN-83|1100||20
7900|"JAMES"|"CLERK"|7698|03-DEC-81|950||30
7934|"MILLER"|"CLERK"|7782|23-JAN-82|1300||10
Adam vonNieda
  • 1,635
  • 2
  • 14
  • 22
  • This looks promising. One question though, does this needs to be installed on the server running Oracle? – Rhythem Aggarwal May 16 '19 at 04:38
  • 1
    No, I understand your situation, you'd install it (it's portable) on the client side, where your Python is running. It's just an unzip. – Adam vonNieda May 16 '19 at 12:24
  • Faster still is simply to use SQL*Plus's 'SET MARKUP CSV ON' mode, see https://blogs.oracle.com/opal/fast-generation-of-csv-and-json-from-oracle-database – Christopher Jones May 23 '19 at 01:56
3

A very fast solution for dumping data in CSV format is SQL*Plus 12.2's CSV mode. If you don't already have it, you can get SQL*Plus for free from the Instant Client packages.

Create a SQL script ex.sql like:

set feedback off
set arraysize 500
select * from locations;
exit

You can/should tune ARRAYSIZE for optimum performance.

Then invoke SQL*Plus with the -m 'csv on' option. This uses a new, fast I/O subsystem:

sqlplus -l -s -m 'csv on delim |' cj@'"localhost/orclpdb1"' @ex.sql

Note the -s option will make the prompt for the password invisible.

The output will be like:

"LOCATION_ID"|"STREET_ADDRESS"|"POSTAL_CODE"|"CITY"|"STATE_PROVINCE"|"COUNTRY_ID"
1000|"1297 Via Cola di Rie"|"00989"|"Roma"||"IT"
. . .

My release announcement blog post has some more details.

Christopher Jones
  • 9,449
  • 3
  • 24
  • 48
0

Just a thought in mind is to create a python method that is multithreaded breaking your dataset in parts. Depending on your target database these files can be read to external tables to eliminate one more step if you want to push it to a table.

Paul C
  • 1
  • 2
    Welcome to SO! Please read the [tour](https://stackoverflow.com/tour), and [How do I write a good answer?](https://stackoverflow.com/help/how-to-answer) – Tomer Shetah Oct 11 '20 at 05:40