2

I am trying to fetch data from SQL server Database (Just a simple SELECT * query).

The table contains around 3-5 Million records. Perfomring a SELECT * on the SQL server directly using SSMS takes around 11-15 minutes.

However, when I am connecting via Python and trying to save data into a pandas dataframe, it takes forever. More than 1 hour.

Here is the code I am using:

import pymssql
import pandas as pd

startTime = datetime.now()

## instance a python db connection object- same form as psycopg2/python-mysql drivers also

conn = pymssql.connect(server=r"xyz", database = "abc", user="user",password="pwd")  
print ('Connecting to DB: ',datetime.now() - startTime )


stmt = "SELECT * FROM BIG_TABLE;"
# Excute Query here
df_big_table = pd.read_sql(stmt,conn)

There must be a way to do this in a better way? Perhaps parallel processing or something to fetch the data quickly.

My end goal is to Migrate this table from SQL server to PostGres.

This is the way I am doing:

  1. Fetch data from SQL server using python
  2. Save it to a pandas dataframe
  3. Save this data in CSV to disk.
  4. Copy the CSV from disk to Postgres.

Proably, I can combine step 3,4 so that I can do the transition in memory, rather than using disk IO.

There are many complexity like table constrains and definitions, etc. Which I will be taking care later on. I cannot use a third party tool.

I am stuck at Step 1,2. So help with the Python script/ Some other opensource language would be really appreciated.

If there is any other way to reach to my end goal, I welcome sugessions!

Anant Vaibhav
  • 325
  • 3
  • 6
  • 20
  • Have you seen this [post](https://stackoverflow.com/questions/1370326/migrate-data-from-ms-sql-to-postgresql) ? – Balaji Ambresh Jun 24 '20 at 18:19
  • @BalajiAmbresh, thanks, Istumbled upon this link, but as I said I dont have to use a licensed tool as now. I have to try if its possible with open source first. It should be definitely possible with some manula work arounds to begin with. – Anant Vaibhav Jun 24 '20 at 18:25
  • 1
    If that's the case, why don't you export data from ms sql to csv and import the csv into postgres? – Balaji Ambresh Jun 24 '20 at 18:47
  • The process has to be scheduled, hecne writing python script to automate. – Anant Vaibhav Jun 24 '20 at 19:21
  • If that's the case, invoking db commands from python script is a better solution than using pandas. – Balaji Ambresh Jun 24 '20 at 19:24

1 Answers1

2

Have you tried using 'chunksize' option of pandas.read_sql? you can get all of that into a single dataframe and produce the csv.

If it takes more time then you can split each chunk into multiple files using the pandas.read_sql as a iterator and then after you did your work consolidate those files into a single one and submit it to postgres.

jottbe
  • 4,228
  • 1
  • 15
  • 31
Betjens
  • 1,353
  • 2
  • 4
  • 13