1

I want to put a Pandas dataframe as a whole in a table in a MS SQL Server database. BULK INSERT is not allowed for common users like myself. I am using pyodbc to connect to my database. I am using Pandas 0.13.1. I read somewhere that from version 0.14 you can use the to_sql method and thus that it is unavailable for my pandas dataframe. Therefore I used an iterator. My dataframe has 2 columns: Col1 and Col2.

My code is working and looks like:

from pyodbc import connect
import pandasas pd

df = pd.read_csv('PathToMyCSVfile', sep=';', header=0)

cnxn = connect(DRIVER = '{SQL Server}', SERVER = 'MyServer', DATABASE = 'MyDatabase')
cursor = cnxn.cursor()

for index, row in df.interrows():
  cursor.execute("INSERT INTO MySchema.MyTable VALUES (?,?)", df['Col1'][index], def['Col2'][index]
  cnxn.commit()

As said, above code is working, but it is slow... What can I do to speed things up?

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
MA53QXR
  • 82
  • 2
  • 11
  • Possible duplicate of [basic pyodbc bulk insert](https://stackoverflow.com/questions/37008848/basic-pyodbc-bulk-insert) – Ignacio Vergara Kausel Oct 11 '17 at 09:26
  • @IgnacioVergaraKausel I don't think that it is a duplicate. BULK INSERT is not allowed on my server. Also the possible duplicate mentions executemany as being a wrapper and not a different approach, thus I think it will not be (much) faster than the execute I used. – MA53QXR Oct 11 '17 at 11:16

1 Answers1

1

The bottleneck you face is that your code sends an INSERT statement for each row in the DataFrame. That is, for a sample data file

id;txt
1;alpha
2;bravo
3;charlie
4;delta
5;echo
6;foxtrot
7;golf

you would need seven (7) round-trips to the server to send the equivalent of

INSERT INTO MySchema.MyTable VALUES (1,'alpha')
INSERT INTO MySchema.MyTable VALUES (2,'bravo')
INSERT INTO MySchema.MyTable VALUES (3,'charlie')
...
INSERT INTO MySchema.MyTable VALUES (7,'golf')

You could speed that up significantly by using a Table Value Constructor to do the same thing in one round-trip:

INSERT INTO MySchema.MyTable VALUES (1,'alpha'),(2,'bravo'),(3,'charlie'), ... ,(7,'golf')

The following code does just that. When I tested it using a file with 5000 rows, running it with rows_per_batch=1000 (the maximum) was about 100 times faster than with rows_per_batch=1 (the equivalent of your current approach).

import numpy
import pandas as pd
import pyodbc
import time


class MyDfInsert:
    def __init__(self, cnxn, sql_stub, data_frame, rows_per_batch=1000):
        # NB: hard limit is 1000 for SQL Server table value constructor
        self._rows_per_batch = 1000 if rows_per_batch > 1000 else rows_per_batch

        self._cnxn = cnxn
        self._sql_stub = sql_stub
        self._num_columns = None
        self._row_placeholders = None
        self._num_rows_previous = None
        self._all_placeholders = None
        self._sql = None

        row_count = 0
        param_list = list()
        for df_row in data_frame.itertuples():
            param_list.append(tuple(df_row[1:]))  # omit zero-based row index
            row_count += 1
            if row_count >= self._rows_per_batch:
                self._send_insert(param_list)  # send a full batch
                row_count = 0
                param_list = list()
        self._send_insert(param_list)  # send any remaining rows

    def _send_insert(self, param_list):
        if len(param_list) > 0:
            if self._num_columns is None:
                # print('[DEBUG] (building items that depend on the number of columns ...)')
                # this only happens once
                self._num_columns = len(param_list[0])
                self._row_placeholders = ','.join(['?' for x in range(self._num_columns)])
                # e.g. '?,?'
            num_rows = len(param_list)
            if num_rows != self._num_rows_previous:
                # print('[DEBUG] (building items that depend on the number of rows ...)')
                self._all_placeholders = '({})'.format('),('.join([self._row_placeholders for x in range(num_rows)]))
                # e.g. '(?,?),(?,?),(?,?)'
                self._sql = f'{self._sql_stub} VALUES {self._all_placeholders}'
                self._num_rows_previous = num_rows
            params = [int(element) if isinstance(element, numpy.int64) else element
                      for row_tup in param_list for element in row_tup]
            # print('[DEBUG]    sql: ' + repr(self._sql))
            # print('[DEBUG] params: ' + repr(params))
            crsr = self._cnxn.cursor()
            crsr.execute(self._sql, params)


if __name__ == '__main__':
    conn_str = (
        'DRIVER=ODBC Driver 11 for SQL Server;'
        'SERVER=192.168.1.134,49242;'
        'Trusted_Connection=yes;'
    )
    cnxn = pyodbc.connect(conn_str, autocommit=True)
    crsr = cnxn.cursor()
    crsr.execute("CREATE TABLE #tmp (id INT PRIMARY KEY, txt NVARCHAR(50))")

    df = pd.read_csv(r'C:\Users\Gord\Desktop\Query1.txt', sep=';', header=0)

    t0 = time.time()

    MyDfInsert(cnxn, "INSERT INTO #tmp (id, txt)", df, rows_per_batch=1000)

    print()
    print(f'Inserts completed in {time.time() - t0:.2f} seconds.')

    cnxn.close()
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Thanx @Gord Thompson, I copied the script and adjusted the things like the driver, server and csv-file. Also I replaced the line `self._sql = '{0} VALUES {1}'.format(self._sql_stub, self._all_placeholders)` with formatted string, because I use Python 2.7. That should have worked, but didn’t. – MA53QXR Oct 12 '17 at 13:41
  • Pyodbc-error 42000 was raised, indicating that the number of parameters was more than the maximum number of parameters (2,100). You mentioned that the hard limit was 1,000 for this table value constructor. In your CSV-file you have 2 columns. I guessed that the 1,000 came from 2,100 dived by 2 columns. So I divided 2,100 by my 8 columns = 262.50. I replaced your 1,000 by 250 and it worked like a charm! Thanx – MA53QXR Oct 12 '17 at 13:46
  • 1
    And the main objective was speeding things up. That goal was accomplished: my method took 1,074 seconds, your method took 32 seconds. A big increase in speed! – MA53QXR Oct 12 '17 at 13:47
  • @MA53QXR re: *"I guessed that the 1,000 came from 2,100 dived by 2 columns"* - No, a T-SQL Table Value Constructor is limited to 1000 rows regardless of the number of columns. The limit on the number of parameters is a different restriction, most likely due to the SQL statement ultimately being executed by a system stored procedure (probably `sp_prepexec`). – Gord Thompson Oct 12 '17 at 14:18