1

I am trying to use Pandas' to_sql method to upload multiple csv files to their respective table in a SQL Server database by looping through them.

    fileLoc = r'C:\Users\hcole\Downloads\stats.csv\\'
    dfArray = ['file1', 'file2', 'file3', 'file4']

    for name in dfArray:
        df = pd.read_csv(fileLoc + name + '.csv')
        df.columns = df.columns.str.replace(' ', '')
        df.to_sql(name, engine, if_exists = 'append', index = False)

My connection string and database connection is working fine; I make it through the first few (and small) files just fine. But as soon as I hit file4, which contains ~135k rows, it takes nearly an hour to upload all of the data to the database. I've tried downgrading to Pandas version 0.22 after reading documentation on the "chunksize" argument of to_sql, but have had no luck with that speeding up the process.

Any tips on how to improve the speed would be appreciated. Thank you.

hstan4
  • 25
  • 6

2 Answers2

0

Try stacking all the data into a single DataFrame before even trying to write to SQL.

Time this. If it takes anywhere near that hour the input data is just big.

If it doesn't it's much more likely that you have a less than stellar connection to your database.

squirtlekip
  • 27
  • 2
  • 6
  • I do have the entire separate csv files loaded into the same dataframe, so the to_sql statement knows which table to add it to. However, I'm using an Azure DB Server to connect to SQL Server if that provides any additional help or concerns. – hstan4 Oct 02 '18 at 16:41
0

Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC

Since SQLAlchemy 1.3.0, released 2019-03-04, sqlalchemy now supports engine = create_engine(sqlalchemy_url, fast_executemany=True)

Nuno Silva
  • 108
  • 10