6

In response to my question How to speed up data wrangling A LOT in Python + Pandas + sqlAlchemy + MSSQL/T-SQL I was kindly directed to Speeding up pandas.DataFrame.to_sql with fast_executemany of pyODBC by @IljaEverilä.

NB For test purposes I am only reading/writing 10k rows.

I added the event listener and a) the function is called but b) clearly executemany is not set as the IF fails and cursor.fast_executemay is not set.

def namedDbSqlAEngineCreate(dbName):
    # Create an engine and switch to the named db
    # returns the engine if successful and None if not
    # 2018-08-23 added fast_executemany accoding to this https://stackoverflow.com/questions/48006551/speeding-up-pandas-dataframe-to-sql-with-fast-executemany-of-pyodbc?rq=1
    engineStr = 'mssql+pyodbc://@' + defaultDSN
    engine = sqla.create_engine(engineStr, echo=False)

    @event.listens_for(engine, 'before_cursor_execute')
    def receive_before_cursor_execute(conn, cursor, statement, params, context, executemany):
        # print("FUNC call")
        if executemany:
            print('executemany')
            cursor.fast_executemany = True
    try:
        engine.execute('USE ' +dbName)
        return(engine)
    except sqla.exc.SQLAlchemyError as ex:
        if ex.orig.args[0] == '08004':
            print('namedDbSqlAEngineCreate:Database %s does not exist' % dbName)
        else:
            print(ex.args[0])
        return(None)

Naturally there is no change in speed.

The code in my original question is unchanged in the to_sql

nasToFillDF.to_sql(name=tempTableName, con=engine.engine, if_exists='replace', chunksize=100, index=False)

because I tried, per the example, setting chunksize = None and receive the error message (which I had encountered previously)

(pyodbc.ProgrammingError) ('The SQL contains -31072 parameter markers, but 100000 parameters were supplied', 'HY000')

What have I done wrong? I guess the executemany parameter of the receive_before_cursor_execute is not set, but if that is the answer I have no idea how to fix it.

Setup is pyodbc 4.0.23, sqlAchemy 1.2.6, Python 3.6.something

Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
Julian Moore
  • 918
  • 3
  • 11
  • 21
  • @IljaEverilä statement as it is in the engine creation or for the to_sql? – Julian Moore Aug 23 '18 at 09:26
  • @IljaEverilä yes, I know (I've been running it and looking at the ouput) but it executes many times; there's lots of output and I wondered which particular statement value was of interest since I create the engine, read in a dataframe, then do the to_sql output – Julian Moore Aug 23 '18 at 09:31
  • @IljaEverilä For the insert itself the statement is "INSERT INTO [tempNasFilledTable] ([Rownumber], [Dateyyyymmdd], [Time], [msSinceEpoch], [Bid], \[Ask\], [Spread], [Contributor], [Region], [City]) VALUES " followed by "(?, ?, ?, ?, ?, ?, ?, ?, ?, ?)" n-times and then the pyodbc error occurs "(pyodbc.Error) ('07002', '[07002] [Microsoft][ODBC Driver 17 for SQL Server]COUNT field incorrect or syntax error (0) (SQLExecDirectW)')" – Julian Moore Aug 23 '18 at 09:34
  • @IljaEverilä what should be concluded from this/done about it then? – Julian Moore Aug 23 '18 at 09:36
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/178602/discussion-between-ilja-everila-and-julian-moore). – Ilja Everilä Aug 23 '18 at 09:58

1 Answers1

5

The error you received is caused by changes introduced in Pandas version 0.23.0, reverted in 0.23.1, and reintroduced in 0.24.0, as explained here. The produced VALUES clause contains 100,000 parameter markers and it'd seem that the count is stored in a signed 16 bit integer, so it overflows and you get the funny

The SQL contains -31072 parameter markers, but 100000 parameters were supplied

You can check for yourself:

In [16]: 100000 % (2 ** 16) - 2 ** 16
Out[16]: -31072

If you would like to keep on using Pandas as is, you will have to calculate and provide a suitable chunksize value, such as the 100 you were using, taking into account both the maximum row limit of 1,000 for VALUES clause and the maximum parameter limit of 2,100 for stored procedures. The details are again explained in the linked Q/A.

Before the change Pandas used to always use executemany() when inserting data. Newer versions detect if the dialect in use supports VALUES clause in INSERT. This detection happens in SQLTable.insert_statement() and cannot be controlled, which is a shame since PyODBC fixed their executemany() performance, given the right flag is enabled.

In order to force Pandas to use executemany() with PyODBC again SQLTable has to be monkeypatched:

import pandas.io.sql

def insert_statement(self, data, conn):
    return self.table.insert(), data

pandas.io.sql.SQLTable.insert_statement = insert_statement

This will be horribly slow, if the Cursor.fast_executemany flag is not set, so remember to set the proper event handler.

Here is a simple performance comparison, using the following dataframe:

In [12]: df = pd.DataFrame({f'X{i}': range(1000000) for i in range(9)})

Vanilla Pandas 0.24.0:

In [14]: %time df.to_sql('foo', engine, chunksize=209)
CPU times: user 2min 9s, sys: 2.16 s, total: 2min 11s
Wall time: 2min 26s

Monkeypatched Pandas with fast executemany enabled:

In [10]: %time df.to_sql('foo', engine, chunksize=500000)
CPU times: user 12.2 s, sys: 981 ms, total: 13.2 s
Wall time: 38 s
Ilja Everilä
  • 50,538
  • 7
  • 126
  • 127
  • Helpful. How would I adapt this to use multi option with df.to_sql to upload a csv? + I tried running your top code bock first before an attempt to use df.to_sql to upload a csv but still got same error. – QHarr Apr 01 '20 at 07:40
  • Not sure if I was supposed to update this line _pandas.io.sql.SQLTable.insert_statement = insert_statement_ – QHarr Apr 01 '20 at 07:46
  • 1
    Not anymore with latest pandas and SQLAlchemy. With pyodbc pass `fast_executemany=True` to `create_engine()` and leave `multi=` at its default `None`. – Ilja Everilä Apr 01 '20 at 11:25
  • Thanks for that. – QHarr Apr 01 '20 at 13:47