25

In a python script, I need to run a query on one datasource and insert each row from that query into a table on a different datasource. I'd normally do this with a single insert/select statement with a tsql linked server join but I don't have a linked server connection to this particular datasource.

I'm having trouble finding a simple pyodbc example of this. Here's how I'd do it but I'm guessing executing an insert statement inside a loop is pretty slow.

result = ds1Cursor.execute(selectSql)

for row in result:
    insertSql = "insert into TableName (Col1, Col2, Col3) values (?, ?, ?)"
    ds2Cursor.execute(insertSql, row[0], row[1], row[2])
    ds2Cursor.commit()

Is there a better bulk way to insert records with pyodbc? Or is this a relatively efficient way to do this anyways. I'm using SqlServer 2012, and the latest pyodbc and python versions.

Zip184
  • 1,792
  • 2
  • 21
  • 34

5 Answers5

24

The best way to handle this is to use the pyodbc function executemany.

ds1Cursor.execute(selectSql)
result = ds1Cursor.fetchall()


ds2Cursor.executemany('INSERT INTO [TableName] (Col1, Col2, Col3) VALUES (?, ?, ?)', result)
ds2Cursor.commit()
LegendaryDude
  • 562
  • 8
  • 23
  • 27
    Just a note, executemany doesn't actually do true bulkinsert. Behind the scene it still does the insert 1 by 1. It really is a wrapper to allow data to be sourced more pythonically. This SO post presents a proper bulkinsert. http://stackoverflow.com/questions/29638136/how-to-speed-up-with-bulk-insert-to-ms-server-from-python-with-pyodbc-from-csv – casbby Dec 09 '16 at 03:34
  • I get following error , Can you point me a solution?https://stackoverflow.com/questions/50597816/python-bulk-insert-gives-typeerror-not-all-arguments-converted-during-string-fo – Ratha May 30 '18 at 06:36
  • 2
    pyodbc version 4.0.19 and later has a `fast_executemany` option that can speed things up considerably. See [this answer](https://stackoverflow.com/a/47057189/2144390) for details. – Gord Thompson Dec 20 '18 at 13:26
  • I understand that Executemany only increases throughput by about 1.5X .Can anyone confirm? – Windstorm1981 Jan 29 '19 at 00:34
  • is there a way to pass a list of field/column names into the SQL statement? I'm trying to load all data from one table to another. The source table has about 60 fields, I'm hoping I can programmatically get a list of the fields in the source and use that in the SQL statement, instead of typing all the field names in the SQL Statment. – franchyze923 Mar 12 '21 at 14:06
22

Here's a function that can do the bulk insert into SQL Server database.

import pyodbc
import contextlib

def bulk_insert(table_name, file_path):
    string = "BULK INSERT {} FROM '{}' (WITH FORMAT = 'CSV');"
    with contextlib.closing(pyodbc.connect("MYCONN")) as conn:
        with contextlib.closing(conn.cursor()) as cursor:
            cursor.execute(string.format(table_name, file_path))
        conn.commit()

This definitely works.

UPDATE: I've noticed at the comments, as well as coding regularly, that pyodbc is better supported than pypyodbc.

NEW UPDATE: remove conn.close() since the with statement handles that automatically.

Naufal
  • 1,203
  • 14
  • 12
  • 8
    This is the correct answer, and should be accepted as such. The executemany method is not a substitute for the speed of bulk insert. Of note, If you want to perform a bulk insert from an iterator rather than a file on the SQL Server itself, the ctds driver is an option. https://pypi.python.org/pypi/ctds/ – Kerr Aug 30 '17 at 19:29
  • Only just checked out the link you provided. I think it looks really good. Gonna give it a try. Thanks. – Naufal Oct 12 '17 at 15:02
  • 1
    "Due to updates, it is better to use pypyodbc instead of pyodbc." - That is no longer true. pyodbc is still under active development and is officially supported by Microsoft. Neither of those statements is true for pypyodbc. – Gord Thompson Dec 20 '18 at 13:29
  • 1
    Thanks for pointing that out Gord. I've noticed that pyodbc has improved a lot since I wrote this. – Naufal Dec 20 '18 at 13:59
  • 3
    This requires that your SQL instance have access to this file as it is pulling it in. It is probably preferable to push it to your database by implementing a bulk copy to SQL. https://github.com/Azure/azure-sqldb-spark – Morrolan Feb 14 '19 at 01:03
4

Since the discontinuation of the pymssql library (which seems to be under development again) we started using the cTDS library developed by the smart people at Zillow and for our surprise it supports the FreeTDS Bulk Insert.

As the name suggests cTDS is written in C on top of FreeTDS library, which makes it fast, really fast. IMHO this is the best way to bulk insert into SQL Server since the ODBC driver does not support bulk insert and executemany or fast_executemany as suggested aren't really bulk insert operations. The BCP tool and T-SQL Bulk Insert has it limitations since it needs the file to be accessible by the SQL Server which can be a deal breaker in many scenarios.

Bellow a naive implementation of Bulk Inserting a CSV file. Please, forgive me for any bug, I wrote this from mind without testing.

I don't know why but for my server which uses Latin1_General_CI_AS I needed to wrap the data which goes into NVarChar columns with ctds.SqlVarChar. I opened an issue about this but developers said the naming is correct, so I changed my code to keep me mentally health.

import csv
import ctds

def _to_varchar(txt: str) -> ctds.VARCHAR:
    """
    Wraps strings into ctds.NVARCHAR.
    """
    if txt == "null":
        return None
    return ctds.SqlNVarChar(txt)

def _to_nvarchar(txt: str) -> ctds.VARCHAR:
    """
    Wraps strings into ctds.VARCHAR.
    """
    if txt == "null":
        return None
    return ctds.SqlVarChar(txt.encode("utf-16le"))

def read(file):
    """
    Open CSV File. 
    Each line is a column:value dict.
    https://docs.python.org/3/library/csv.html?highlight=csv#csv.DictReader
    """
    with open(file, newline='') as csvfile:
        reader = csv.DictReader(csvfile)
        for row in reader:
            yield row

def transform(row):
    """
    Do transformations to data before loading.

    Data specified for bulk insertion into text columns (e.g. VARCHAR,
    NVARCHAR, TEXT) is not encoded on the client in any way by FreeTDS.
    Because of this behavior it is possible to insert textual data with
    an invalid encoding and cause the column data to become corrupted.

    To prevent this, it is recommended the caller explicitly wrap the
    the object with either ctds.SqlVarChar (for CHAR, VARCHAR or TEXT
    columns) or ctds.SqlNVarChar (for NCHAR, NVARCHAR or NTEXT columns).
    For non-Unicode columns, the value should be first encoded to
    column’s encoding (e.g. latin-1). By default ctds.SqlVarChar will
    encode str objects to utf-8, which is likely incorrect for most SQL
    Server configurations.

    https://zillow.github.io/ctds/bulk_insert.html#text-columns
    """
    row["col1"] = _to_datetime(row["col1"])
    row["col2"] = _to_int(row["col2"])
    row["col3"] = _to_nvarchar(row["col3"])
    row["col4"] = _to_varchar(row["col4"])

    return row

def load(rows):
    stime = time.time()

    with ctds.connect(**DBCONFIG) as conn:
        with conn.cursor() as curs:
            curs.execute("TRUNCATE TABLE MYSCHEMA.MYTABLE")

        loaded_lines = conn.bulk_insert("MYSCHEMA.MYTABLE", map(transform, rows))

    etime = time.time()
    print(loaded_lines, " rows loaded in ", etime - stime)

if __name__ == "__main__":
    load(read('data.csv'))
kafran
  • 729
  • 7
  • 13
  • If one gets it to work, it´s pretty much the fastest option out there (at least for mssql), thanks for sharing. – Bennimi Oct 22 '21 at 08:50
  • does this work w/ postgres? can it support postgress 'conflict' handler on bulk insert? – mike01010 May 07 '23 at 21:35
3

You should use executemany with the cursor.fast_executemany = True, to improve the performance.

pyodbc's default behaviour is to run many inserts, but this is inefficient. By applying fast_executemany, you can drastically improve performance.

Here is an example:

connection = pyodbc.connect('DRIVER={ODBC Driver 17 for SQL Server}',host='host', database='db', user='usr', password='foo')
cursor = connection.cursor()

# I'm the important line
cursor.fast_executemany = True

sql = "insert into TableName (Col1, Col2, Col3) values (?, ?, ?)"
tuples=[('foo','bar', 'ham'), ('hoo','far', 'bam')]
cursor.executemany(sql, tuples)
cursor.commit()
cursor.close()
connection.close()

Docs. Note that this has been available since 4.0.19 Oct 23, 2017

Preston
  • 7,399
  • 8
  • 54
  • 84
1

Helpful function for generating the SQL required for using execute_many():

def generate_bulk_insert_sql(self, data:pd.DataFrame, table_name) -> str:
    table_sql = str([c for c in data.columns]).replace("'","").replace("[", "").replace("]", "")
    return f'INSERT INTO {table_name} ({table_sql}) VALUES ({("?,"*len(data.columns))[:-1]})