47

Below is my code that I'd like some help with. I am having to run it over 1,300,000 rows meaning it takes up to 40 minutes to insert ~300,000 rows.

I figure bulk insert is the route to go to speed it up? Or is it because I'm iterating over the rows via for data in reader: portion?

#Opens the prepped csv file
with open (os.path.join(newpath,outfile), 'r') as f:
    #hooks csv reader to file
    reader = csv.reader(f)
    #pulls out the columns (which match the SQL table)
    columns = next(reader)
    #trims any extra spaces
    columns = [x.strip(' ') for x in columns]
    #starts SQL statement
    query = 'bulk insert into SpikeData123({0}) values ({1})'
    #puts column names in SQL query 'query'
    query = query.format(','.join(columns), ','.join('?' * len(columns)))

    print 'Query is: %s' % query
    #starts curser from cnxn (which works)
    cursor = cnxn.cursor()
    #uploads everything by row
    for data in reader:
        cursor.execute(query, data)
        cursor.commit()

I am dynamically picking my column headers on purpose (as I would like to create the most pythonic code possible).

SpikeData123 is the table name.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
TangoAlee
  • 1,260
  • 2
  • 13
  • 34
  • Once you know your code is working fine, remove the print it should make it faster. – zulqarnain Apr 14 '15 at 22:19
  • This is not a complete answer, so I am leaving it as a comment. You might want to try turning off any indexes on the table while you load the data, then reindexing the table after the insert is complete, if you can do that. It might make quite a bit of difference, and it might not, but it won't take long to try. – Itsme2003 Aug 05 '20 at 04:53
  • I think the best way to Bulk Insert with Python is using the cTDS library as described in https://stackoverflow.com/a/64992905/3147247 – kafran Nov 24 '20 at 18:46

6 Answers6

67

As noted in a comment to another answer, the T-SQL BULK INSERT command will only work if the file to be imported is on the same machine as the SQL Server instance or is in an SMB/CIFS network location that the SQL Server instance can read. Thus it may not be applicable in the case where the source file is on a remote client.

pyodbc 4.0.19 added a Cursor#fast_executemany feature which may be helpful in that case. fast_executemany is "off" by default, and the following test code ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... took approximately 22 seconds to execute on my test machine. Simply adding crsr.fast_executemany = True ...

cnxn = pyodbc.connect(conn_str, autocommit=True)
crsr = cnxn.cursor()
crsr.execute("TRUNCATE TABLE fast_executemany_test")

crsr.fast_executemany = True  # new in pyodbc 4.0.19

sql = "INSERT INTO fast_executemany_test (txtcol) VALUES (?)"
params = [(f'txt{i:06d}',) for i in range(1000)]
t0 = time.time()
crsr.executemany(sql, params)
print(f'{time.time() - t0:.1f} seconds')

... reduced the execution time to just over 1 second.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • 6
    How would you insert from a `DataFrame` using this method? I tried `df.values.tolist()` as the `VALUES` section of the SQL query, but that didn't work. Also, where would the `.txt.` or `.csv` file actually go in your answer? – OverflowingTheGlass Dec 13 '17 at 22:52
  • 1
    @CameronTaylor **(1)** re: DataFrame - You may need to convert the values from `numpy` objects to native Python types as illustrated in [this answer](https://stackoverflow.com/a/46098694/2144390). **(2)** re: CSV file location - It would need to be someplace that your Python application can read. From there you would pull the information into memory, create a list of tuples, and then call `.executemany`. – Gord Thompson Dec 14 '17 at 14:51
  • 4
    @CameronTaylor - See [this answer](https://stackoverflow.com/a/48065082/2144390) for details on using `fast_executemany` with pandas (via SQLAlchemy). – Gord Thompson Feb 16 '18 at 13:55
  • 1
    Please be aware that there is an issue with pyodbc 4.0.24 for fast_executemany and columns that are varchar(max) which gives error: hy010. Please see https://github.com/mkleehammer/pyodbc/issues/371 which boils down to changing your varchar(max) to varchar(4000) fixed it for me. – Bryan Bailliache Oct 12 '18 at 14:58
  • Gord you lovely lovely man, thanks for this. I was testing out writing to a dB ms SQL server dB on my local machine before testing on our prod dB. took me 8 mins to write a 4 column of 25k rows (int only) to my a dB on my machine using sql alchemy! – Umar.H Jan 22 '19 at 00:01
  • Really helpful!!! With fast_executemany, writing a data load process in Python became feasible. Otherwise, I had to switch to Java. – Rupen B Feb 26 '19 at 20:32
  • I needed to set `cnxn.autocommit = False` to see a performance improvement. My pyodbc version is 4.0.0. – mherzog Jul 16 '23 at 16:41
51

Update - May 2022: bcpandas and bcpyaz are wrappers for Microsoft's bcp utility.


Update - April 2019: As noted in the comment from @SimonLang, BULK INSERT under SQL Server 2017 and later apparently does support text qualifiers in CSV files (ref: here).


BULK INSERT will almost certainly be much faster than reading the source file row-by-row and doing a regular INSERT for each row. However, both BULK INSERT and BCP have a significant limitation regarding CSV files in that they cannot handle text qualifiers (ref: here). That is, if your CSV file does not have qualified text strings in it ...

1,Gord Thompson,2015-04-15
2,Bob Loblaw,2015-04-07

... then you can BULK INSERT it, but if it contains text qualifiers (because some text values contains commas) ...

1,"Thompson, Gord",2015-04-15
2,"Loblaw, Bob",2015-04-07

... then BULK INSERT cannot handle it. Still, it might be faster overall to pre-process such a CSV file into a pipe-delimited file ...

1|Thompson, Gord|2015-04-15
2|Loblaw, Bob|2015-04-07

... or a tab-delimited file (where represents the tab character) ...

1→Thompson, Gord→2015-04-15
2→Loblaw, Bob→2015-04-07

... and then BULK INSERT that file. For the latter (tab-delimited) file the BULK INSERT code would look something like this:

import pypyodbc
conn_str = "DSN=myDb_SQLEXPRESS;"
cnxn = pypyodbc.connect(conn_str)
crsr = cnxn.cursor()
sql = """
BULK INSERT myDb.dbo.SpikeData123
FROM 'C:\\__tmp\\biTest.txt' WITH (
    FIELDTERMINATOR='\\t',
    ROWTERMINATOR='\\n'
    );
"""
crsr.execute(sql)
cnxn.commit()
crsr.close()
cnxn.close()

Note: As mentioned in a comment, executing a BULK INSERT statement is only applicable if the SQL Server instance can directly read the source file. For cases where the source file is on a remote client, see this answer.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • 1
    Thank you Gord! I need some follow up help but I wanted to say thank you! – TangoAlee Apr 16 '15 at 14:30
  • 12
    I know this is an old post, but this solution only works if the file resides on the same server as SQL Server (or on a location where the SQL Server's service user is able to see). So if the file resides on my workstation and the SQL Server is elsewhere thans this solution will not work – Gabor Sep 20 '17 at 09:29
  • 1
    @Gabor - Good point. See [this answer](https://stackoverflow.com/a/47057189/2144390) for an alternative. – Gord Thompson Nov 01 '17 at 14:33
  • Nice. Do you know if it works the same manner with sqlalchemy as well? (as behind the scene it uses pyodbc, for me the answer would be yes, but you never know...:-) ) – Gabor Nov 01 '17 at 17:14
  • @Gabor - `fast_executemany` is a very recent addition to pyodbc and it is "off" by default (for compatibility with drivers that don't properly support the internal ODBC mechanisms that it uses) so I doubt that SQLAlchemy takes advantage of it yet. You may want to [ask them](https://www.sqlalchemy.org/support.html) about it. – Gord Thompson Nov 01 '17 at 17:38
  • @Gabor - See [this answer](https://stackoverflow.com/a/48065082/2144390) for a way to leverage `fast_executemany` with SQLAlchemy. – Gord Thompson Jan 19 '18 at 17:06
  • 1
    Since SQL Server 2017, `BULK INSERT` supports valid CSV's according to RFC 4180. See https://learn.microsoft.com/en-us/sql/t-sql/statements/bulk-insert-transact-sql – Simon Lang Apr 11 '18 at 07:07
  • @Gord Thompson. I was able to load csv to sql server using python and it has loaded very fast. however the process has locked the table and I was only able to execute query using with(nolock) and when I did sp_who2 I found few transaction were in sleeping mode with programName as python. I had to kill that session to release the lock. afterthat I saw zero records on sql server table – vikrant rana Jul 26 '19 at 10:41
1

yes bulk insert is right path for loading large files into a DB. At a glance I would say that the reason it takes so long is as you mentioned you are looping over each row of data from the file which effectively means are removing the benefits of using a bulk insert and making it like a normal insert. Just remember that as it's name implies that it is used to insert chucks of data. I would remove loop and try again.

Also I'd double check your syntax for bulk insert as it doesn't look correct to me. check the sql that is generated by pyodbc as I have a feeling that it might only be executing a normal insert

Alternatively if it is still slow I would try using bulk insert directly from sql and either load the whole file into a temp table with bulk insert then insert the relevant column into the right tables. or use a mix of bulk insert and bcp to get the specific columns inserted or OPENROWSET.

Michael Moura
  • 219
  • 1
  • 9
1

This problem was frustrating me and I didn't see much improvement using fast_executemany until I found this post on SO. Specifically, Bryan Bailliache's comment regarding max varchar. I had been using SQLAlchemy and even ensuring better datatype parameters did not fix the issue for me; however, switching to pyodbc did. I also took Michael Moura's advice of using a temp table and found it shaved of even more time. I wrote a function in case anyone might find it useful. I wrote it to take either a list or list of lists for the insert. It took my insert of the same data using SQLAlchemy and Pandas to_sql from taking upwards of sometimes 40 minutes down to just under 4 seconds. I may have been misusing my former method though.

connection

def mssql_conn():
    conn = pyodbc.connect(driver='{ODBC Driver 17 for SQL Server}',
                          server=os.environ.get('MS_SQL_SERVER'),
                          database='EHT',
                          uid=os.environ.get('MS_SQL_UN'),
                          pwd=os.environ.get('MS_SQL_PW'),
                          autocommit=True)
    return conn

Insert function

def mssql_insert(table,val_lst,truncate=False,temp_table=False):
    '''Use as direct connection to database to insert data, especially for
       large inserts. Takes either a single list (for one row),
       or list of list (for multiple rows). Can either append to table
       (default) or if truncate=True, replace existing.'''
    conn = mssql_conn()
    cursor = conn.cursor()
    cursor.fast_executemany = True
    tt = False
    qm = '?,'
    if isinstance(val_lst[0],list):
        rows = len(val_lst)
        params = qm * len(val_lst[0])
    else:
        rows = 1
        params = qm * len(val_lst)
        val_lst = [val_lst]
    params = params[:-1]
    if truncate:
        cursor.execute(f"TRUNCATE TABLE {table}")
    if temp_table:
        #create a temp table with same schema
        start_time = time.time()
        cursor.execute(f"SELECT * INTO ##{table} FROM {table} WHERE 1=0")
        table = f"##{table}"
        #set flag to indicate temp table was used
        tt = True
    else:
        start_time = time.time()
    #insert into either existing table or newly created temp table
    stmt = f"INSERT INTO {table} VALUES ({params})"
    cursor.executemany(stmt,val_lst)
    if tt:
        #remove temp moniker and insert from temp table
        dest_table = table[2:]
        cursor.execute(f"INSERT INTO {dest_table} SELECT * FROM {table}")
        print('Temp table used!')
        print(f'{rows} rows inserted into the {dest_table} table in {time.time() - 
              start_time} seconds')
    else:
        print('No temp table used!')
        print(f'{rows} rows inserted into the {table} table in {time.time() - 
              start_time} seconds')
    cursor.close()
    conn.close()

And my console results first using a temp table and then not using one (in both cases, the table contained data at the time of execution and Truncate=True):

No temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 10.595500707626343 
seconds

Temp table used!
18204 rows inserted into the CUCMDeviceScrape_WithForwards table in 3.810380458831787 
seconds
1

FWIW, I gave a few methods of inserting to SQL Server some testing of my own. I was actually able to get the fastest results by using SQL Server Batches and using pyodbcCursor.execute statements. I did not test the save to csv and BULK INSERT, I wonder how it compares.

Here's my blog on the testing I did: http://jonmorisissqlblog.blogspot.com/2021/05/python-pyodbc-and-batch-inserts-to-sql.html

Jon Morisi
  • 101
  • 1
  • 1
0

adding to Gord Thompson's answer:

# add the below line for controlling batch size of insert
cursor.fast_executemany_rows = batch_size # by default it is 1000
Eric Aya
  • 69,473
  • 35
  • 181
  • 253