Setup: I have a pre-processed dataset on an MS SQL Server that is about 500.000.000 rows and 20 columns, where one is a rather long text column (varchar(1300)), which amounts to about 35gb data space on the SQL database. I'm working on the physical machine where the MS SQL Server is running, so no network traffic needed, and it has 128gb RAM. MS SQL Server is set to take 40gb RAM at maximum. I want to import the dataset into Python for further processing. Assume some deep learning experimentation, which is important, because I need to be able to transfer the text column as is.
Anecdote: For testing the import code, I used a small subsample of the dataset of about 700.000 rows. This takes about 1 min to run, Python goes up to 700mb RAM usage, and saving the variable to filesystem after importing amounts to an about 250mb file in size. By extrapolation, importing the full dataset should take about 700 minutes and result in a 175gb file. Which is quite a lot, especially compared to say copying the full 31gb table within SQL, which takes a few minutes at most. I let it run for a day to see what happens to no avail.
Alternatives: I tried not using pandas
and sqlalchemy
but pyodbc
directly, which led me to believe that the problem lies with how pyodbc deals with data import, as it stores the queried data in a rows object, which I only managed to read row-wise in a loop, which seems very inefficient to me. I don't know if pandas
and sqlalchemy
manage to do that differently. I also tried not importing the full dataset with a single select statement, but splitting it up into lots of smaller ones, which resulted in the small test dataset taking 30 minutes instead of 1 minute to load.
Question: How do I load this large (but not-so-large, compared to other databases) dataset into Python, at all? Also, there has to be a way to do this efficiently? As in it should not take significantly longer than copying the full table within SQL and it should not take significantly more space than the table in the SQL database. I do not understand why the data size blows up so much during the process. The solution should not need extraction of the table to any other mediums than Python first (i.e. no .csv files or the like), though the use of any other Python packages is fine.
import pyodbc
import pandas as pd
import pandas.io.sql as pdsql
import sqlalchemy
def load_data():
query = "select * from data.table"
engine = sqlalchemy.create_engine('mssql+pyodbc://server/database?driver=SQL+Server+Native+Client+11.0?trusted_connection=yes')
dat = pdsql.read_sql(query, engine)
dat = dat.sort_values(['id', 'date'])
return dat