-1

I need to insert a big (200k row) data frame into ms SQL table. when I do line by line insert, it takes a very long time. I have tried the following:

import pandas as pd
import pyodbc
import numpy as np
   
engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes")
    
df.to_sql('<db_table_name>', engine, if_exists='append')

Is there an option for commit and connection close?

It seems that df.to_sql is executing, not putting out any errors.

I tried setting chunksize argument with the parameter, and it was the same: no errors or insertion.

Mario
  • 1,631
  • 2
  • 21
  • 51
user1471980
  • 10,127
  • 48
  • 136
  • 235
  • 1
    try passing `chunksize` , see if it helps , like start with 5000 – eshirvana Aug 04 '23 at 15:13
  • 1
    Please [edit] your post with specific timings and errors. We cannot interpret *taking a very long time* or *it is not working*. You also have a *engine* typo and appear to be passing an sqlalchemy connection string but not object which may reason for your error. – Parfait Aug 06 '23 at 17:27
  • @Parfait "appear to be passing an sqlalchemy connection string but not object", can you please clarify this. I fixed the typo, still not inserting data to ms sql table. – user1471980 Aug 07 '23 at 04:53

5 Answers5

1

For me, the issue appear because SQL save python str as TEXT.

So I change this behaviour:

  1. Measure the lenght of the string columns in the f:

     import oracledb
     import sqlalchemy as sqla
    
     cols = df.dtypes[df.dtypes=='object'].index 
     dic_str_max_len = {}
     for col in cols:
         res = measurer(df[col].values.astype(str)).max(axis=0)
         dic_str_max_len[col] = res
    
  2. Specify the mapping that we will pass to sql:

    type_mapping = {col : sqla.types.String(dic_str_max_len[col]) for col in cols}
    
  3. Finally we pass the mapping:

     df.to_sql(name, engine, if_exists='replace', dtype=type_mapping, index=False)
    

This makes the load significantly faster.

0

There are different ways you could approach this as follows:

Use Batching (Chunking)

Try experimenting with different chunk sizes to find the optimal balance between performance and memory usage.

chunksize = 1000  # Adjust this value as needed
df.to_sql('<db_table_name>', engine, if_exists='append', chunksize=chunksize)

Bulk Insertion:

You can use the to_sql() method with the method='multi' parameter.

df.to_sql('<db_table_name>', engine, if_exists='append', method='multi')

Use SQLAlchemy Core: Expanding on your solution

from sqlalchemy import create_engine
from sqlalchemy.orm import sessionmaker

engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes")
Session = sessionmaker(bind=engine)
session = Session()

conn = engine.raw_connection()
cursor = conn.cursor()

# Assuming 'data' is a list of dictionaries where keys match the column names
# This code creates an INSERT statement and executes it in bulk
insert_statement = df.to_sql('<db_table_name>', con=engine, if_exists='append', index=False)
cursor.executemany(insert_statement, data)

conn.commit()
cursor.close()
conn.close()
Mirwise Khan
  • 1,317
  • 17
  • 25
0

Try this,

import pyodbc 
conn = pyodbc.connect('Driver={SQL Server};''Server=yourserver;'
                      'Database=your_database;''Trusted_Connection=yes;')

    cursor = conn.cursor()
    
    cursor.execute('''
            INSERT INTO table (WITH TABLOCK) (col1, col2, col3)
            VALUES
                ('col1','col12','col3')
                
                    ''')
    conn.commit()

Optimize connection string in python,like maxconcurrent.Notice my insert (WITH TABLOCK) hint.Try optimizing your code.

KumarHarsh
  • 5,046
  • 1
  • 18
  • 22
0

Not sure if this question is related to Azure Databricks. Since, I am using Azure Databricks and I faced the same issue, I am sharing my answer. Be it pyodbc or jdbc, for my case all these libraries were slow.

Came across "sql-spark-connector" which is about 15 times faster than the generic JDBC connector for writing data to SQL Server.

I am using this connector to load around 140 tables with more than 10 tables having 5 million+ rows. All of these tables are imported in less than 30 mins(Since I have 100 tables, I am multithreading). So it should be minutes to import a single big table.

Here is the code to write the dataframe to SQL.

jdbcUrl = "jdbc:sqlserver://yourSQLServer:1433;database=yourDatabase"
user = "sqlusername"
password = "sqluserpassword"
tName ="yourTableName"

df.write.format("com.microsoft.sqlserver.jdbc.spark").mode("overwrite").option(
    "reliabilityLevel", "BEST_EFFORT"
).option("tableLock", "false").option("schemaCheckEnabled", "false").option(
    "truncate", "true"
).option(
    "url", jdbcUrl
).option(
    "dbtable", tName
).option(
    "nullValue", ""
).option(
    "batchsize", 1048576
).option(
    "user", user
).option(
    "password", password
).save()

You can also refer the sql-spark-connector page to get more samples and details here - sql-spark-connector

Hope it helps!

rs1
  • 82
  • 2
  • 6
-1
  1. Batch Insertion: Instead of inserting rows one by one, try to batch the insertions. Divide your data frame into smaller chunks (e.g., 1000 rows per batch) and insert them in batches. This can significantly reduce the overhead of individual transactions.

  2. Use fast_executemany Parameter: In the to_sql function, you can set the method parameter to 'multi' and use the fast_executemany=True option. This will use the executemany method for inserting rows, which can be more efficient.

Code:

from sqlalchemy import create_engine
engine = create_engine("mssql+pyodbc://server1/<database>?driver=odbc drvier 17 for sql server?trusted_connection=yes", fast_executemany=True)

chunksize = 1000
for i in range(0, len(df), chunksize):
    df_chunk = df[i:i + chunksize]
    df_chunk.to_sql('<db_table_name>', engine, if_exists='append', method='multi')
Navkar Jain
  • 195
  • 1
  • 8