9

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

  1. If I try the method in:

    Bulk Insert A Pandas DataFrame Using SQLAlchemy

    then I get a no attribute copy_from error.

  2. If I try the multithreading method from:

    http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

    then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

Community
  • 1
  • 1
user1566200
  • 1,826
  • 4
  • 27
  • 47

3 Answers3

6

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))
5

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

chunksize=1000, method='multi'

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

The parameter method controls the SQL insertion clause used. Possible values are:

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

  • I keep getting `PrestoUserError` when I try `to_sql` with a Presto connection. Could you please share any example of Pandas writing to Presto? I obtained the connection to Presto using `prestodb.dbapi.connect` API – Nitin Jan 29 '22 at 11:38
  • It is trying this query first `SELECT name FROM sqlite_master WHERE type='table' AND name=?;` and complains about `;` at the end :-/ – Nitin Jan 29 '22 at 11:40
3

even I had the same issue so I applied sqlalchemy with fast execute many.

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.
rohit singh
  • 159
  • 1
  • 5
  • adding decorator the will cause the issue: `('HY090', '[HY090] [Microsoft][ODBC Driver Manager] Invalid string or buffer length (0) (SQLBindParameter)')` – Led Jun 13 '19 at 17:02