0

I can't seem to figure out what I'm doing incorrect in writing rows to a SQL Server using python 3.11.4, pyodbc 4.0.39, sqlalchemy 2.0.20. The problem is that about 100k rows takes nearly a day and a half to insert into a table. The table does not have any indexes and is empty. I'm hoping the snippet of code provide some insight.

cn_1 = sa.create_engine('mssql+pyodbc://servername/dbname?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)

df.to_sql(
    "mytable",
    cn_1,
    if_exists="replace",
    index=False,
)
Dale K
  • 25,246
  • 15
  • 42
  • 71

2 Answers2

0

to_sql function doesn't perform bulk insertion by default. To improve performance, you can try using the method='multi' parameter in the to_sql function, which uses the executemany method for bulk insertion.

df.to_sql(
     "mytable",
      cn_1,
      if_exists="replace",
      index=False,
      method='multi'
)

Another reason for slow performance can be network latency, server performance , driver configuration

Sauron
  • 551
  • 2
  • 11
  • You are mistaken. `.to_sql(method=None)` (the default) results in `.executemany()` calls at the DBAPI (e.g., pyodbc) layer. `.to_sql(method="multi")` constructs a multi-row INSERT which is passed to `.execute()`, not `.executemany()`. For pyodbc, `.executemany()` with `fast_executemany=True` is almost always faster than `.to_sql(method="multi")`. Details [here](https://stackoverflow.com/a/50706219/2144390). – Gord Thompson Aug 29 '23 at 20:24
  • Thanks to all that have answered. Gord, I have been reading through stack overflow and appreciate your knowledge and insight and wish I could send you directly a question but i guess it's not how stackoverflow works. So I will try to answer the questions you've asked. Yes likely there are null values in first row. Anything that should be done as a result? So what I am gathering is using multi is not best for sql server... but should i use chunk size to keep it fast? – JeffVanderDoes Aug 30 '23 at 13:03
0

Your approach is correct for most cases. However, there is a known issue with fast_executemany=True when working with a DataFrame that contains a lot of None or NaN values:

https://github.com/mkleehammer/pyodbc/issues/741

One possible workaround would be to use this as the method= for .to_sql(), e.g.,

from mssql_insert_json import mssql_insert_json

# …

cn_1 = sa.create_engine(
    'mssql+pyodbc://servername/dbname?driver=ODBC+Driver+17+for+SQL+Server',
    # `fast_executemany=True` is not required
)

df.to_sql(
    "mytable",
    cn_1,
    if_exists="replace",
    index=False,
    method=mssql_insert_json
)
Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • I think you might have just pushed against my ability boundaries. I read the link above but got lost in transitioning to Json. The above snippet is clear, but do I need to do anything to convert the data to Json? Thanks! – JeffVanderDoes Aug 30 '23 at 18:01
  • The `mssql_insert_json` function takes care of that for you. pandas provides the column names as `keys` and the values as `data_iter`. The function creates a list of dict (JSON) with `json_data = [dict(zip(keys, row)) for row in data_iter]` and then converts it to a string with `json.dumps(json_data, default=str)` – Gord Thompson Aug 30 '23 at 18:21