26

I am downloading Json data from an api website and using sqlalchemy, pyodbc and pandas' to_sql function to insert that data into a MSSQL server.

I can download up to 10000 rows, however I have to limit the chunksize to 10 otherwise I get the following error:

DBAPIError: (pyodbc.Error) ('07002', '[07002] [Microsoft][SQL Server Native Client 11.0]COUNT field incorrect or syntax error (0) (SQLExecDirectW)') [SQL: 'INSERT INTO [TEMP_producing_entity_details]

There are around 500 Million rows to download, it's just crawling at this speed. Any advice on a workaround?

Thanks,

dfundako
  • 8,022
  • 3
  • 18
  • 34
Ryan
  • 277
  • 1
  • 4
  • 8
  • What is different between the logic for 10 which works and 11 which doesn't work? – dfundako Jun 04 '18 at 21:35
  • 10 or 100 is the number of rows it inserts into the database at a given time. – Ryan Jun 05 '18 at 02:19
  • Can you try running SQL Profiler to see what the server is trying to process? If pandas is attempting to call a system stored procedure with more than 2100 parameters then that might be causing the error. – Gord Thompson Jun 05 '18 at 14:25
  • That appears to be the problem. It's trying to insert 2400 parameters. Is there a way around this? – Ryan Jun 05 '18 at 15:38
  • @GordThompson That appears to be the problem. It's trying to insert 2400 parameters. Is there a way around this? is there a more efficient way to insert the panda's dataframe into the SQL table? It was working before, I thought, then I upgraded the pyodbc package and it stopped. Thanks – Ryan Jun 05 '18 at 15:44
  • It's not a change in pyodbc that caused the problem, it's a change in pandas. – Gord Thompson Jun 05 '18 at 17:43
  • @GordThompson so which version of pandas would I roll back to if I wanted to avoid this problem? – Ryan Jun 05 '18 at 18:19

4 Answers4

55

At the time this question was asked, pandas 0.23.0 had just been released. That version changed the default behaviour of .to_sql() from calling the DBAPI .executemany() method to constructing a table-value constructor (TVC) that would improve upload speed by inserting multiple rows with a single .execute() call of an INSERT statement. Unfortunately that approach often exceeded T-SQL's limit of 2100 parameter values for a stored procedure, leading to the error cited in the question.

Shortly thereafter, a subsequent release of pandas added a method= argument to .to_sql(). The default – method=None – restored the previous behaviour of using .executemany(), while specifying method="multi" would tell .to_sql() to use the newer TVC approach.

Around the same time, SQLAlchemy 1.3 was released and it added a fast_executemany=True argument to create_engine() which greatly improved upload speed using Microsoft's ODBC drivers for SQL Server. With that enhancement, method=None proved to be at least as fast as method="multi" while avoiding the 2100-parameter limit.

So with current versions of pandas, SQLAlchemy, and pyodbc, the best approach for using .to_sql() with Microsoft's ODBC drivers for SQL Server is to use fast_executemany=True and the default behaviour of .to_sql(), i.e.,

connection_uri = (
    "mssql+pyodbc://scott:tiger^5HHH@192.168.0.199/db_name"
    "?driver=ODBC+Driver+17+for+SQL+Server"
)
engine = create_engine(connection_uri, fast_executemany=True)
df.to_sql("table_name", engine, index=False, if_exists="append")

This is the recommended approach for apps running on Windows, macOS, and the Linux variants that Microsoft supports for its ODBC driver. If you need to use FreeTDS ODBC, then .to_sql() can be called with method="multi" and chunksize= as described below.


(Original answer)

Prior to pandas version 0.23.0, to_sql would generate a separate INSERT for each row in the DataTable:

exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
    N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
    0,N'row000'
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
    N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
    1,N'row001'
exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6)',
    N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2)',
    2,N'row002'

Presumably to improve performance, pandas 0.23.0 now generates a table-value constructor to insert multiple rows per call

exec sp_prepexec @p1 output,N'@P1 int,@P2 nvarchar(6),@P3 int,@P4 nvarchar(6),@P5 int,@P6 nvarchar(6)',
    N'INSERT INTO df_to_sql_test (id, txt) VALUES (@P1, @P2), (@P3, @P4), (@P5, @P6)',
    0,N'row000',1,N'row001',2,N'row002'

The problem is that SQL Server stored procedures (including system stored procedures like sp_prepexec) are limited to 2100 parameters, so if the DataFrame has 100 columns then to_sql can only insert about 20 rows at a time.

We can calculate the required chunksize using

# df is an existing DataFrame
#
# limit based on sp_prepexec parameter count
tsql_chunksize = 2097 // len(df.columns)
# cap at 1000 (limit for number of rows inserted by table-value constructor)
tsql_chunksize = 1000 if tsql_chunksize > 1000 else tsql_chunksize
#
df.to_sql('tablename', engine, index=False, if_exists='replace',
          method='multi', chunksize=tsql_chunksize)

However, the fastest approach is still likely to be:

  • dump the DataFrame to a CSV file (or similar), and then

  • have Python call the SQL Server bcp utility to upload that file into the table.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • 1
    Seems like an odd update? I imagine the a significant percentage of users will want to insert dataframes containing > 2100 params... SQL Server built for large datasets... – user3725021 Jun 11 '18 at 08:15
  • 2
    @user3725021 - I agree that the pandas update should have taken the T-SQL limits into account, at least for the SQL Server dialect. We can work around those with the `chunksize` argument of `to_sql`, but it would be better if pandas took care of that itself. – Gord Thompson Jun 11 '18 at 17:07
  • 2
    I think they changed it one more time and added a `method` argument that defaults to `None`, or in other words the old row-by-row/`executemany` behaviour. If passed `"multi"`, it'll use the multiple-values insert. It also enables more complex handling, if passed a callable — allows using for example Postgresql's `COPY`. – Ilja Everilä Jul 13 '19 at 14:11
  • 1
    I'm finding the .to_sql performance better with `method='multi', chunksize=tsql_chunksize` as recommended here. Thanks for this! – Nikhil VJ Jun 05 '20 at 10:42
  • 1
    does this 2100 limit also apply to delete statements? – Nikhil VJ Jun 05 '20 at 11:27
  • @Nikhil - I would presume so, yes. – Gord Thompson Jun 05 '20 at 11:48
2

For me solution was NOT TO USE:

engine = create_engine(connection_uri, fast_executemany=True)

instead I just played with:

df.to_sql('tablename', engine, index=False, if_exists='replace',
          method='multi', chunksize=100)

Here instead of chunksize=100 I've put chunksize=90 and it started to work. Obviously because previous table was smaller and for larger number of columns you might need smaller number here. Play around with it if you don't want to play with calculations which might be wrong for various reasons.

Hrvoje
  • 13,566
  • 7
  • 90
  • 104
1

Made a few modifications based on Gord Thompson's answer. This will auto-calculate the chunksize and keep it to the lowest nearest integer value which fits in the 2100 parameters limit :

import math
df_num_of_cols=len(df.columns)
chunknum=math.floor(2100/df_num_of_cols)
df.to_sql('MY_TABLE',con=engine,schema='myschema',chunksize=chunknum,if_exists='append',method='multi',index=False )
Amit S
  • 19
  • 3
1

Don't have a reputation so I cannot comment on Amit S. I just tried this way, with chuknum calculated with the method set to 'multi' Still shows me the error:

[Microsoft][SQL Server Native Client 11.0][SQL Server]The incoming request has too many parameters. The server supports a maximum of 2100 parameters. Reduce the number of parameters and resend the request

So I just modified:

chunknum=math.floor(2100/df_num_of_cols) 

to

chunknum=math.floor(2100/df_num_of_cols) - 1

It seems now working perfectly. I think should be some edge problem...

Afsanefda
  • 3,069
  • 6
  • 36
  • 76