Speeding up Pandas to_sql()?

Question

I have a 1,000,000 x 50 Pandas DataFrame that I am currently writing to a SQL table using:

df.to_sql('my_table', con, index=False)

It takes an incredibly long time. I've seen various explanations about how to speed up this process online, but none of them seem to work for MSSQL.

If I try the method in:

Bulk Insert A Pandas DataFrame Using SQLAlchemy

then I get a no attribute copy_from error.
If I try the multithreading method from:

http://techyoubaji.blogspot.com/2015/10/speed-up-pandas-tosql-with.html

then I get a QueuePool limit of size 5 overflow 10 reach, connection timed out error.

Is there any easy way to speed up to_sql() to an MSSQL table? Either via BULK COPY or some other method, but entirely from within Python code?

I would use [this](http://stackoverflow.com/a/33817026/5741205) or similar approach - BCP should be __very__ fast — MaxU - stand with Ukraine, Jan 09 '17 at 19:06

Babu Arunachalam · Answer 1 · 2017-10-19T04:33:20.983

6

I've used ctds to do a bulk insert that's a lot faster with SQL server. In example below, df is the pandas DataFrame. The column sequence in the DataFrame is identical to the schema for mydb.

import ctds

conn = ctds.connect('server', user='user', password='password', database='mydb')
conn.bulk_insert('table', (df.to_records(index=False).tolist()))

edited Oct 19 '17 at 04:33

answered Oct 17 '17 at 21:20

Babu Arunachalam

241
4
9

Using this every day and is fast, very fast! – Caio Belfort Dec 12 '18 at 21:13

score 5 · Answer 2 · answered Jun 12 '19 at 05:29

in pandas 0.24 you can use method ='multi' with chunk size of 1000 which is the sql server limit

chunksize=1000, method='multi'

https://pandas.pydata.org/pandas-docs/stable/user_guide/io.html#io-sql-method

New in version 0.24.0.

The parameter method controls the SQL insertion clause used. Possible values are:

None: Uses standard SQL INSERT clause (one per row). 'multi': Pass multiple values in a single INSERT clause. It uses a special SQL syntax not supported by all backends. This usually provides better performance for analytic databases like Presto and Redshift, but has worse performance for traditional SQL backend if the table contains many columns. For more information check the SQLAlchemy documention.

I keep getting `PrestoUserError` when I try `to_sql` with a Presto connection. Could you please share any example of Pandas writing to Presto? I obtained the connection to Presto using `prestodb.dbapi.connect` API — Nitin, Jan 29 '22 at 11:38
It is trying this query first `SELECT name FROM sqlite_master WHERE type='table' AND name=?;` and complains about `;` at the end :-/ — Nitin, Jan 29 '22 at 11:40

rohit singh · Answer 3 · 2018-12-19T14:40:29.577

even I had the same issue so I applied sqlalchemy with fast execute many.

from sqlalchemy import event, create_engine
engine = create_egine('connection_string_with_database')
@event.listens_for(engine, 'before_cursor_execute')
def plugin_bef_cursor_execute(conn, cursor, statement, params, context,executemany):
   if executemany:
       cursor.fast_executemany = True  # replace from execute many to fast_executemany.
       cursor.commit()

always make sure that the given function should be present after the engine variable and before cursor execute.

conn = engine.execute()
df.to_sql('table', con=conn, if_exists='append', index=False) # for reference go to the pandas to_sql documentation.

adding decorator the will cause the issue: `('HY090', '[HY090] [Microsoft][ODBC Driver Manager] Invalid string or buffer length (0) (SQLBindParameter)')` — Led, Jun 13 '19 at 17:02

Speeding up Pandas to_sql()?

3 Answers3