22

With this table:

CREATE TABLE test_insert (
    col1 INT,
    col2 VARCHAR(10),
    col3 DATE
)

the following code takes 40 seconds to run:

import pyodbc

from datetime import date


conn = pyodbc.connect('DRIVER={SQL Server Native Client 10.0};'
    'SERVER=localhost;DATABASE=test;UID=xxx;PWD=yyy')

rows = []
row = [1, 'abc', date.today()]
for i in range(10000):
    rows.append(row)

cursor = conn.cursor()
cursor.executemany('INSERT INTO test_insert VALUES (?, ?, ?)', rows)

conn.commit()

The equivalent code with psycopg2 only takes 3 seconds. I don't think mssql is that much slower than postgresql. Any idea on how to improve the bulk insert speed when using pyodbc?

EDIT: Add some notes following ghoerz's discovery

In pyodbc, the flow of executemany is:

  • prepare statement
  • loop for each set of parameters
    • bind the set of parameters
    • execute

In ceODBC, the flow of executemany is:

  • prepare statement
  • bind all parameters
  • execute
sayap
  • 6,169
  • 2
  • 36
  • 40
  • Try using an explicit transaction. – Lasse V. Karlsen Apr 17 '11 at 13:55
  • Reading http://stackoverflow.com/questions/1063770/in-python-using-pyodbc-how-do-you-perform-transactions, it doesn't seem like pyodbc has support for explicit transaction. – sayap Apr 17 '11 at 14:26
  • That's not the way I read it. You turn off auto-commit, and have to explicitly call rollback or commit. However, I have no idea if it makes a difference or not, but it would be something I would try myself. – Lasse V. Karlsen Apr 17 '11 at 15:31
  • What you described is exactly what my code does. Autocommit is off by default. – sayap Apr 17 '11 at 15:32
  • I don't see any reason for this to be slow. What version of SQL Server, and is the installation a standard installation, i.e. no funny configs etc? Like running databases from USB etc? You can also try and attach SQL Profiler to the db and see if you can spot where the inefficiency comes from, but your equivalent code in c# executes in less than 3 seconds on my pc. – Ryk Apr 18 '11 at 03:10
  • Ryk, I think this problem is specific to python bindings for mssql, and Lasse had the right suggestion, i.e. to get pyodbc to wrap the 10000 INSERT's in one transaction. Unfortunately, I couldn't find a way to do that, and has since changed my code to use BULK INSERT instead. – sayap Apr 18 '11 at 03:21

7 Answers7

11

I was having a similar issue with pyODBC inserting into a SQL Server 2008 DB using executemany(). When I ran a profiler trace on the SQL side, pyODBC was creating a connection, preparing the parametrized insert statement, and executing it for one row. Then it would unprepare the statement, and close the connection. It then repeated this process for each row.

I wasn't able to find any solution in pyODBC that didn't do this. I ended up switching to ceODBC for connecting to SQL Server, and it used the parametrized statements correctly.

ghoerz
  • 126
  • 2
  • 5
  • Thanks for confirmation and tips. I have filed this as http://code.google.com/p/pyodbc/issues/detail?id=250 – sayap Mar 29 '12 at 23:33
7

Trying to insert +2M rows into MSSQL using pyodbc was taking an absurdly long amount of time compared to bulk operations in Postgres (psycopg2) and Oracle (cx_Oracle). I did not have the privileges to use the BULK INSERT operation, but was able to solve the problem with the method below.

Many solutions correctly suggested fast_executemany, however, there are some tricks to using it correctly. First, I noticed that pyodbc was committing after each row when autocommit was set to True in the connect method, therefore this must be set to False. I also observed a non-linear slow down when inserting more than ~20k rows at a time, i.e. inserting 10k rows was subsecond, but 50k was upwards of 20s. I assume that the transaction log is getting quite large and slowing the whole thing down. Therefore, you must chunk your insert and commit after each chunk. I found 5k rows per chunk delivered good performance, but this would obviously depend on many factors (the data, the machine, db config etc...).

import pyodbc

CHUNK_SIZE = 5000

def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in xrange(0, len(l), n): #use xrange in python2, range in python3
        yield l[i:i + n]

mssql_conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                            server='<SERVER,PORT>',
                            timeout=1,
                            port=<PORT>,
                            uid=<UNAME>, 
                            pwd=<PWD>,
                            TDS_Version=7.2,
                            autocommit=False) #IMPORTANT

mssql_cur = mssql_conn.cursor()
mssql_cur.fast_executemany = True #IMPORTANT

params = [tuple(x) for x in df.values]

stmt = "truncate table <THE TABLE>"
mssql_cur.execute(stmt)
mssql_conn.commit()

stmt = """
INSERT INTO <THE TABLE> (field1...fieldn) VALUES (?,...,?)
"""
for chunk in chunks(params, CHUNK_SIZE): #IMPORTANT
    mssql_cur.executemany(stmt, chunk)
    mssql_conn.commit()
Dave Lyndon
  • 786
  • 5
  • 9
  • Thanks a bunch man! Saved may day.... – Niels Broertjes Mar 25 '22 at 10:35
  • just flagging that for pyodbc.connect, autocommit=False by default (i think since 2020 or even earlier). and echoing your answer that 1. fast_executemany = True helps and 2. using 5-25k chunks also appears, anecdotally, to help. – 10mjg Oct 03 '22 at 18:57
3

Tried both ceODBC and mxODBC and both were also painfully slow. Ended up going with an adodb connection with help from http://www.ecp.cc/pyado.html. Total run time improved by a factor of 6!

comConn = win32com.client.Dispatch(r'ADODB.Connection')
DSN = 'PROVIDER=Microsoft.Jet.OLEDB.4.0;DATA SOURCE=%s%s' %(dbDIR,dbOut)
comConn.Open(DSN)

rs = win32com.client.Dispatch(r'ADODB.Recordset')
rs.Open('[' + tblName +']', comConn, 1, 3)

for f in values:
    rs.AddNew(fldLST, f)

rs.Update()
Jay
  • 31
  • 1
3

pyodbc 4.0.19 added a Cursor#fast_executemany option to help address this issue. See this answer for details.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
1

I wrote data to text file and then invoked BCP utility. Much much quicker. From about 20 to 30 minutes to a few seconds.

MikeP
  • 751
  • 5
  • 6
0

I was using pypyODBC w/ python 3.5 and Microsoft SQL Server Management Studio. A particular table ( ~70K rows w/ 40 vars) was taking 112 seconds to INSERT using the .executemany() method with pypyodbc.

With ceODBC it took 4 seconds.

Nirvan
  • 623
  • 7
  • 19
0

I fount it insane while uploading ~1.6 million rows (from ~195 csv files) to a Microsoft SQL Server database table. Did lots of research over the net, but nothing worked for me. After reading this article Why do the queries slow down after a few minutes, when trying to build a DB?, I looked into the table, how it is structured. My table had 15 columns and all of them had indexes. I deleted all of them and ran my python script, it was a damn miracle. My upload time reduced by 98%.

Deepak.K
  • 135
  • 1
  • 1
  • 9