Large SQL query (64. Mio rows) into pandas dataframe takes way to long

Question

I need to load a table from SQL server. It's the first time that I'm doing that, which is causing my uncertainty and missing expertise in this domain.

I need the data of the whole table which has 64 million rows. To write it into a pandas dataframe I tried the following code and also tried SQLalchemy and some chunksize examples which failed to work at all. Maybe dask is more efficient than pandas, however I have never used dask before.

The code works up to TOP 1000000, which takes 7.5 minutes. However larger queries are not finishing successfully.

Maybe someone can help me figuring out the 'beste practice' to do this.

import pandas as pd
import pyodbc

conn = pyodbc.connect('Driver={SQL Server};'
                      'Server=SERVER\DEV;'
                      'Database=Clusteranalysis;'
                      'Trusted_Connection=yes;')
cursor = conn.cursor()
query = 'SELECT * FROM [Clusteranalysis].[dbo].[Data] WHERE Rc=0'
result_port_map = pd.read_sql(query, conn)
print(result_port_map.head(5))

Thank you for your help!

I think you need to rethink what you're doing see: https://stackoverflow.com/questions/14262433/large-data-work-flows-using-pandas. Basically if the df becomes unwieldy to store in memory then you need to use fast out of memory storage — EdChum, Apr 10 '19 at 13:32
I also read this discussion, but couldn't figure out how to apply it to my problem. :/ — Mike_H, Apr 10 '19 at 13:34
Well first question is do you need to load the whole data? You could export the data to HDF or use dask. The other thing is do you need to use pandas? You already have data in your database so you could just run your queries on that unless you want to do some analysis that is too difficult or not possible in a DB then you could load it into HDF — EdChum, Apr 10 '19 at 13:36
Depending on what you actually want to calculate you should try to do as much as possible within the DBMS (e.g. pre-aggregating) to reduce the data before unloading it to Pandas — dnoeth, Apr 10 '19 at 13:36
@EdChum Yes I need to load everything. That's the bad fact.. Okay I have no experience using HDF nor dask. So pandas is the wrong too to use here, when I'm not able to pre-aggregate my data before unloading it? — Mike_H, Apr 10 '19 at 13:37
Pandas works fine with HDF, if you read the first answer it shows how to load into HDF. You could do this in chunks, then just call queries against HDF — EdChum, Apr 10 '19 at 13:38
I'll try to implement this once again and come back with an answer by my own. :) Maybe it's way easier to help help mw then. — Mike_H, Apr 10 '19 at 13:42
@EdChum I really struggle to setup the HDF the right way. What is the mapping needed for and can I directly pass my SQL query? I'm way to confused about all of this. :/ — Mike_H, Apr 10 '19 at 13:49
You just read N rows from your DB and then store into a HDFStore which is just a file — EdChum, Apr 10 '19 at 13:51
https://stackoverflow.com/questions/45200903/import-huge-data-set-from-sql-server-to-hdf5 This appears to be close to my issue. But stil bugs around. Maybe I need to stop working on this and start over everything tomorrow morning. — Mike_H, Apr 10 '19 at 13:54

Large SQL query (64. Mio rows) into pandas dataframe takes way to long

0 Answers0