pyodbc - very slow bulk insert speed

Question

With this table:

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

the following code takes 40 seconds to run:

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?

EDIT: Add some notes following ghoerz's discovery

In pyodbc, the flow of executemany is:

prepare statement
loop for each set of parameters
- bind the set of parameters
- execute

In ceODBC, the flow of executemany is:

prepare statement
bind all parameters
execute

Reading http://stackoverflow.com/questions/1063770/in-python-using-pyodbc-how-do-you-perform-transactions, it doesn't seem like pyodbc has support for explicit transaction. — sayap, Apr 17 '11 at 14:26
That's not the way I read it. You turn off auto-commit, and have to explicitly call rollback or commit. However, I have no idea if it makes a difference or not, but it would be something I would try myself. — Lasse V. Karlsen, Apr 17 '11 at 15:31
What you described is exactly what my code does. Autocommit is off by default. — sayap, Apr 17 '11 at 15:32
I don't see any reason for this to be slow. What version of SQL Server, and is the installation a standard installation, i.e. no funny configs etc? Like running databases from USB etc? You can also try and attach SQL Profiler to the db and see if you can spot where the inefficiency comes from, but your equivalent code in c# executes in less than 3 seconds on my pc. — Ryk, Apr 18 '11 at 03:10
Ryk, I think this problem is specific to python bindings for mssql, and Lasse had the right suggestion, i.e. to get pyodbc to wrap the 10000 INSERT's in one transaction. Unfortunately, I couldn't find a way to do that, and has since changed my code to use BULK INSERT instead. — sayap, Apr 18 '11 at 03:21

score 11 · Accepted Answer · answered Mar 29 '12 at 19:43

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

Thanks for confirmation and tips. I have filed this as http://code.google.com/p/pyodbc/issues/detail?id=250 — sayap, Mar 29 '12 at 23:33

score 7 · Answer 2 · answered Sep 24 '18 at 08:05

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()

just flagging that for pyodbc.connect, autocommit=False by default (i think since 2020 or even earlier). and echoing your answer that 1. fast_executemany = True helps and 2. using 5-25k chunks also appears, anecdotally, to help. — 10mjg, Oct 03 '22 at 18:57

score 3 · Answer 3 · answered Apr 05 '12 at 21:03

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()

score 3 · Answer 4 · answered Nov 01 '17 at 15:31

3

pyodbc 4.0.19 added a Cursor#fast_executemany option to help address this issue. See this answer for details.

answered Nov 01 '17 at 15:31

Gord Thompson

116,920
32
215
418

score 1 · Answer 5 · answered May 31 '17 at 19:45

1

I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.

answered May 31 '17 at 19:45

MikeP

751
5
6

Nirvan · Answer 6 · 2017-02-26T22:20:41.600

0

I was using pypyODBC w/ python 3.5 and Microsoft SQL Server Management Studio. A particular table ( ~70K rows w/ 40 vars) was taking 112 seconds to INSERT using the .executemany() method with pypyodbc.

With ceODBC it took 4 seconds.

edited Feb 26 '17 at 22:20

answered Feb 26 '17 at 21:51

Nirvan

623
7
19

score 0 · Answer 7 · answered Aug 01 '23 at 11:00

I fount it insane while uploading ~1.6 million rows (from ~195 csv files) to a Microsoft SQL Server database table. Did lots of research over the net, but nothing worked for me. After reading this article Why do the queries slow down after a few minutes, when trying to build a DB?, I looked into the table, how it is structured. My table had 15 columns and all of them had indexes. I deleted all of them and ran my python script, it was a damn miracle. My upload time reduced by 98%.

pyodbc - very slow bulk insert speed

7 Answers7

Linked