Need advice for python pandas using pyodbc to_sql to sqlserver extremely slow

Question

I can't seem to figure out what I'm doing incorrect in writing rows to a SQL Server using python 3.11.4, pyodbc 4.0.39, sqlalchemy 2.0.20. The problem is that about 100k rows takes nearly a day and a half to insert into a table. The table does not have any indexes and is empty. I'm hoping the snippet of code provide some insight.

cn_1 = sa.create_engine('mssql+pyodbc://servername/dbname?driver=ODBC+Driver+17+for+SQL+Server', fast_executemany=True)

df.to_sql(
    "mytable",
    cn_1,
    if_exists="replace",
    index=False,
)

Try adding the parameter `method='multi'` to the call to `to_sql()` — JonSG, Aug 29 '23 at 18:41
You might also want to `cn_1 = sa.create_engine(...).connect()` — JonSG, Aug 29 '23 at 18:44
How many columns are there in the DataFrame? Also, does the first row of the DataFrame contain any `None` or `NaN` values? — Gord Thompson, Aug 29 '23 at 19:18
`method="multi"` is not the best choice for `mssql+pyodbc://`. Details [here](https://stackoverflow.com/a/50706219/2144390). — Gord Thompson, Aug 29 '23 at 20:41

score 0 · Answer 1 · answered Aug 29 '23 at 18:41

0

to_sql function doesn't perform bulk insertion by default. To improve performance, you can try using the method='multi' parameter in the to_sql function, which uses the executemany method for bulk insertion.

df.to_sql(
     "mytable",
      cn_1,
      if_exists="replace",
      index=False,
      method='multi'
)

Another reason for slow performance can be network latency, server performance , driver configuration

answered Aug 29 '23 at 18:41

Sauron

551
2
11

You are mistaken. `.to_sql(method=None)` (the default) results in `.executemany()` calls at the DBAPI (e.g., pyodbc) layer. `.to_sql(method="multi")` constructs a multi-row INSERT which is passed to `.execute()`, not `.executemany()`. For pyodbc, `.executemany()` with `fast_executemany=True` is almost always faster than `.to_sql(method="multi")`. Details [here](https://stackoverflow.com/a/50706219/2144390). – Gord Thompson Aug 29 '23 at 20:24
Thanks to all that have answered. Gord, I have been reading through stack overflow and appreciate your knowledge and insight and wish I could send you directly a question but i guess it's not how stackoverflow works. So I will try to answer the questions you've asked. Yes likely there are null values in first row. Anything that should be done as a result? So what I am gathering is using multi is not best for sql server... but should i use chunk size to keep it fast? – JeffVanderDoes Aug 30 '23 at 13:03

score 0 · Answer 2 · answered Aug 30 '23 at 15:48

0

Your approach is correct for most cases. However, there is a known issue with fast_executemany=True when working with a DataFrame that contains a lot of None or NaN values:

https://github.com/mkleehammer/pyodbc/issues/741

One possible workaround would be to use this as the method= for .to_sql(), e.g.,

from mssql_insert_json import mssql_insert_json

# …

cn_1 = sa.create_engine(
    'mssql+pyodbc://servername/dbname?driver=ODBC+Driver+17+for+SQL+Server',
    # `fast_executemany=True` is not required
)

df.to_sql(
    "mytable",
    cn_1,
    if_exists="replace",
    index=False,
    method=mssql_insert_json
)

answered Aug 30 '23 at 15:48

Gord Thompson

116,920
32
215
418

I think you might have just pushed against my ability boundaries. I read the link above but got lost in transitioning to Json. The above snippet is clear, but do I need to do anything to convert the data to Json? Thanks! – JeffVanderDoes Aug 30 '23 at 18:01
The `mssql_insert_json` function takes care of that for you. pandas provides the column names as `keys` and the values as `data_iter`. The function creates a list of dict (JSON) with `json_data = [dict(zip(keys, row)) for row in data_iter]` and then converts it to a string with `json.dumps(json_data, default=str)` – Gord Thompson Aug 30 '23 at 18:21

Need advice for python pandas using pyodbc to_sql to sqlserver extremely slow

2 Answers2