I have a oracle server from where i need to extract data using python into files. These files are used by downstream systems as inputs.
Few technical details:
Oracle and Python are running on different server. The database is client hosted while all scripts are running on an AWS RHEL EC2 server. Details of the EC2 instance are highlighted in this screenshot. .
My Approach To do this i choose Python's Pyodbc library to connect to the Remote Oracle client and extract data using the SQL query. Below is an extract of the code that fetches the data based on SQL Query provided.
def fetch_data_to_file(self,curr,query,dataset):
try:
self.logger.info('Executing query: {}'.format(query))
start = time.time()
curr.execute(query)
query_time = time.time()-start
start = time.time()
rowcount=0
with open(dataset,'a+') as f:
writer = csv.writer(f,delimiter='|')
writer.writerow([i[0] for i in curr.description])
self.logger.info('Writing file: {}'.format(dataset))
while True:
rows = curr.fetchmany(self.batch_limit)
self.logger.info('Writing {} rows'.format(self.batch_limit))
rowcount+=len(rows)
if not rows:
break
self.timer.info('{} Query Execution Time: {} seconds'.format(dataset,query_time))
self.timer.info('{} File Writing Time: {} seconds. at {} rows/second'.format(dataset,time.time()-start,int(rowcount / (time.time()-start))))
self.logger.info("File written.")
except Exception as e:
self.error.info("Error in fetching data.Error: {}".format(e))
raise SystemExit(1)
The dataset that I am extracting is close to 8GB uncompressed size (close to 35Million rows returned). And it takes the code ~1.5 hours to download the file on my EC2 server. I tested with multiple variations of batch_limits and found 1Million - 2Million to be the optimal size to batch download data however I am unsure if there is something else i can do more efficiently to figure out what my batch size should be.
What else have i looked into I was looking online to figure out ways to write large datasets to files using python, and many suggested using Pandas. I tried to figure that out but failed to do so. Also, it is important that i preserve the data and its datatypes while extracting them to files.
My ask here is: is there anything i can do better to make this code more efficient ? Is Python the best suited language for this? (Please note, i need to be able to automate the jobs whatever language i choose. Going for licensed libraries is a bit difficult at this point due to internal pricing challenges at my firm).
Also, not sure if this helps, but here is a snapshot of my memory usage while the code was downloading data (htop)