0

I am reading data from SQL server and writing into a .dat file using dataframes.

The total number of records is 31M+ or 4.87GB data. Its taking 11.40sec to read and write into the .dat file. The current VB system is completing in 6min.

Is there anyway I can improve the performance of my script?

def test_SQLPandas():
  conn = pyodbc.connect(sql, conn_str)
  df = pd.read_sql(sql, conn, chunksize=50000)
  i=1
  for data in df:
     data.to_csv('smaple.dat', sep='\t', mode='a')
     i+=1
Dale K
  • 25,246
  • 15
  • 42
  • 71
  • 1
    what if u skipped pandas and read it using regular python, then transfer the files using plain old csv? have u tried that to notice any difference? my guess is when u read the data from ur sql server into vanilla python, u can get it as a generator, and then easily pull the data into ur target file. – sammywemmy Apr 08 '20 at 20:47
  • 1
    if you're just reading and writing then maybe use `chunk` parameter in pandas if not use `dask dataframe` – Umar.H Apr 08 '20 at 20:49
  • Does this answer your question? [How to read a 6 GB csv file with pandas](https://stackoverflow.com/questions/25962114/how-to-read-a-6-gb-csv-file-with-pandas) – Umar.H Apr 08 '20 at 20:49
  • i haven't tried using python but i will give a shot. I am OK to use anything as long as i can see difference in performance sammywemmy – RAMA KRISHNA Apr 09 '20 at 21:19

1 Answers1

0

I believe sql server is multithreaded and can respond to multiple simultaneous requests. As such I think you should take advantage of multiprocessing as such:

import multiprocessing as mp
use_threads=mp.cpu_count()-1
batch_size=50000
conn = pyodbc.connect(sql, conn_str)
query='''your query...
          OFFSET {} ROWS 
FETCH NEXT {} ROWS ONLY;'''

tot_rows=pd.read_sql("select count(*) as tot_rows from your_table").iloc[0,0]

def read_sql(batch):
    df = pd.read_sql(query.format(batch*batch_size, (batch+1)*batch_size), conn)

with mp.Pool(use_threads) as pool:
    dfs=pool.map(read_sql, range(tot_rows//batch_size+1))

The idea is to batch the query using offset and fetch in sql server and have multiple processes on your machine request for these batches simultaneously

you could then concat into one table and write the csv

pd.concat(dfs).to_csv("sample.csv")

or exploit mp again

def to_csv(input):
    df, name=input
    df.to_csv(name)

with mp.Pool(use_threads) as pool:
        pool.map(to_csv, [(df, f'sample_{i}') for i,df in enumerate(dfs)])
CircArgs
  • 570
  • 6
  • 16