how do you insert data frame to ms sql table faster

Question

I need to insert a big (200k row) data frame into ms SQL table. when I do line by line insert, it takes a very long time. I have tried the following:

import pandas as pd
import pyodbc
import numpy as np
   
engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes")
    
df.to_sql('<db_table_name>', engine, if_exists='append')

Is there an option for commit and connection close?

It seems that df.to_sql is executing, not putting out any errors.

I tried setting chunksize argument with the parameter, and it was the same: no errors or insertion.

try passing `chunksize` , see if it helps , like start with 5000 — eshirvana, Aug 04 '23 at 15:13
Please [edit] your post with specific timings and errors. We cannot interpret *taking a very long time* or *it is not working*. You also have a *engine* typo and appear to be passing an sqlalchemy connection string but not object which may reason for your error. — Parfait, Aug 06 '23 at 17:27
@Parfait "appear to be passing an sqlalchemy connection string but not object", can you please clarify this. I fixed the typo, still not inserting data to ms sql table. — user1471980, Aug 07 '23 at 04:53

score 1 · Accepted Answer · answered Aug 10 '23 at 07:31

For me, the issue appear because SQL save python str as TEXT.

So I change this behaviour:

Measure the lenght of the string columns in the f:

 import oracledb
 import sqlalchemy as sqla

 cols = df.dtypes[df.dtypes=='object'].index 
 dic_str_max_len = {}
 for col in cols:
     res = measurer(df[col].values.astype(str)).max(axis=0)
     dic_str_max_len[col] = res

Specify the mapping that we will pass to sql:

type_mapping = {col : sqla.types.String(dic_str_max_len[col]) for col in cols}

Finally we pass the mapping:

 df.to_sql(name, engine, if_exists='replace', dtype=type_mapping, index=False)

This makes the load significantly faster.

We can use: measurer = np.vectorize(len) – Raul Martinez Cid Aug 17 '23 at 12:59 — Raul Martinez Cid, Aug 17 '23 at 12:59

score 0 · Answer 2 · answered Aug 09 '23 at 21:10

There are different ways you could approach this as follows:

Use Batching (Chunking)

Try experimenting with different chunk sizes to find the optimal balance between performance and memory usage.

chunksize = 1000  # Adjust this value as needed
df.to_sql('<db_table_name>', engine, if_exists='append', chunksize=chunksize)

Bulk Insertion:

You can use the to_sql() method with the method='multi' parameter.

df.to_sql('<db_table_name>', engine, if_exists='append', method='multi')

Use SQLAlchemy Core: Expanding on your solution

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes")
Session = sessionmaker(bind=engine)
session = Session()

conn = engine.raw_connection()
cursor = conn.cursor()

# Assuming 'data' is a list of dictionaries where keys match the column names
# This code creates an INSERT statement and executes it in bulk
insert_statement = df.to_sql('<db_table_name>', con=engine, if_exists='append', index=False)
cursor.executemany(insert_statement, data)

conn.commit()
cursor.close()
conn.close()

score 0 · Answer 3 · answered Aug 11 '23 at 09:10

Try this,

import pyodbc 
conn = pyodbc.connect('Driver={SQL Server};''Server=yourserver;'
                      'Database=your_database;''Trusted_Connection=yes;')

    cursor = conn.cursor()
    
    cursor.execute('''
            INSERT INTO table (WITH TABLOCK) (col1, col2, col3)
            VALUES
                ('col1','col12','col3')
                
                    ''')
    conn.commit()

Optimize connection string in python,like maxconcurrent.Notice my insert (WITH TABLOCK) hint.Try optimizing your code.

score 0 · Answer 4 · answered Aug 11 '23 at 09:52

Not sure if this question is related to Azure Databricks. Since, I am using Azure Databricks and I faced the same issue, I am sharing my answer. Be it pyodbc or jdbc, for my case all these libraries were slow.

Came across "sql-spark-connector" which is about 15 times faster than the generic JDBC connector for writing data to SQL Server.

I am using this connector to load around 140 tables with more than 10 tables having 5 million+ rows. All of these tables are imported in less than 30 mins(Since I have 100 tables, I am multithreading). So it should be minutes to import a single big table.

Here is the code to write the dataframe to SQL.

jdbcUrl = "jdbc:sqlserver://yourSQLServer:1433;database=yourDatabase"
user = "sqlusername"
password = "sqluserpassword"
tName ="yourTableName"

df.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite").option(
    "reliabilityLevel", "BEST_EFFORT"
).option("tableLock", "false").option("schemaCheckEnabled", "false").option(
    "truncate", "true"
).option(
    "url", jdbcUrl
).option(
    "dbtable", tName
).option(
    "nullValue", ""
).option(
    "batchsize", 1048576
).option(
    "user", user
).option(
    "password", password
).save()

You can also refer the sql-spark-connector page to get more samples and details here - sql-spark-connector

Hope it helps!

score -1 · Answer 5 · answered Aug 04 '23 at 15:24

-1

Batch Insertion: Instead of inserting rows one by one, try to batch the insertions. Divide your data frame into smaller chunks (e.g., 1000 rows per batch) and insert them in batches. This can significantly reduce the overhead of individual transactions.
Use fast_executemany Parameter: In the to_sql function, you can set the method parameter to 'multi' and use the fast_executemany=True option. This will use the executemany method for inserting rows, which can be more efficient.

Code:

from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes", fast_executemany=True)

chunksize = 1000
for i in range(0, len(df), chunksize):
    df_chunk = df[i:i + chunksize]
    df_chunk.to_sql('<db_table_name>', engine, if_exists='append', method='multi')

answered Aug 04 '23 at 15:24

Navkar Jain

195
1
8

what about commit and close? – user1471980 Aug 04 '23 at 15:25
@user1471980 It commits automatically – Navkar Jain Aug 04 '23 at 15:29
does df column name needs to match mssql column names, I am getting field incorrect or syntax error – user1471980 Aug 04 '23 at 15:47
Yes they should – Navkar Jain Aug 04 '23 at 15:54
I am getting count field incorrect or syntax error, I have checked the the df column numbers and mssql table column names match. Only thing I see is that insert into statement has [ ] arround column names. I cannot figure this out – user1471980 Aug 04 '23 at 16:21
Is there any way I can see the file and run at my end or maybe using google colab or something and was your code working before? – Navkar Jain Aug 04 '23 at 16:30
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/254806/discussion-between-navkar-jain-and-user1471980). – Navkar Jain Aug 04 '23 at 16:35
Did you check: https://stackoverflow.com/questions/64201656/how-to-use-bulk-insert-to-insert-data-from-dataframe-to-sql-server-table ? – Pablo Henkowski Aug 08 '23 at 17:41
did you use chatgpt at all to write this? – starball Aug 19 '23 at 01:38

how do you insert data frame to ms sql table faster

5 Answers5

Use Batching (Chunking)

Bulk Insertion:

Use SQLAlchemy Core: Expanding on your solution