How do I load a large dataset into Python from MS SQL Server?

Question

Setup: I have a pre-processed dataset on an MS SQL Server that is about 500.000.000 rows and 20 columns, where one is a rather long text column (varchar(1300)), which amounts to about 35gb data space on the SQL database. I'm working on the physical machine where the MS SQL Server is running, so no network traffic needed, and it has 128gb RAM. MS SQL Server is set to take 40gb RAM at maximum. I want to import the dataset into Python for further processing. Assume some deep learning experimentation, which is important, because I need to be able to transfer the text column as is.

Anecdote: For testing the import code, I used a small subsample of the dataset of about 700.000 rows. This takes about 1 min to run, Python goes up to 700mb RAM usage, and saving the variable to filesystem after importing amounts to an about 250mb file in size. By extrapolation, importing the full dataset should take about 700 minutes and result in a 175gb file. Which is quite a lot, especially compared to say copying the full 31gb table within SQL, which takes a few minutes at most. I let it run for a day to see what happens to no avail.

Alternatives: I tried not using pandas and sqlalchemy but pyodbc directly, which led me to believe that the problem lies with how pyodbc deals with data import, as it stores the queried data in a rows object, which I only managed to read row-wise in a loop, which seems very inefficient to me. I don't know if pandas and sqlalchemy manage to do that differently. I also tried not importing the full dataset with a single select statement, but splitting it up into lots of smaller ones, which resulted in the small test dataset taking 30 minutes instead of 1 minute to load.

Question: How do I load this large (but not-so-large, compared to other databases) dataset into Python, at all? Also, there has to be a way to do this efficiently? As in it should not take significantly longer than copying the full table within SQL and it should not take significantly more space than the table in the SQL database. I do not understand why the data size blows up so much during the process. The solution should not need extraction of the table to any other mediums than Python first (i.e. no .csv files or the like), though the use of any other Python packages is fine.

import pyodbc 
import pandas as pd 
import pandas.io.sql as pdsql 
import sqlalchemy

def load_data():

    query = "select * from data.table"
    engine = sqlalchemy.create_engine('mssql+pyodbc://server/database?driver=SQL+Server+Native+Client+11.0?trusted_connection=yes')
    dat = pdsql.read_sql(query, engine)

    dat = dat.sort_values(['id', 'date'])

    return dat

Did you ever find a solution to this? Running into the same problems.. — hyphen, Mar 11 '21 at 12:53
As far as I can remember, no. And I've moved on to new projects since. — iraserd, Mar 12 '21 at 11:01
According to https://stackoverflow.com/questions/449560/how-do-i-determine-the-size-of-an-object-in-python your result will become quite much bigger since python has a lot of overhead for lists and tuples. There are probably some more "bare metal" libs that can implement more efficient storage, but then you probably need pandas for machine learning. Probably the people working with such large datasets have bigger computers :) — siggemannen, Jan 15 '23 at 19:48

How do I load a large dataset into Python from MS SQL Server?

0 Answers0