Pandas .to_sql fails silently randomly

Question

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.

Here is my code:

class SQLServerHandler(DataBaseHandler):
    
    ...


    def _getSQLAlchemyEngine(self):
        '''
            Get an sqlalchemy engine
            from the connection string

            The fast_executemany fails silently:

            https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
        '''
        # escape special characters as required by sqlalchemy
        dbParams = urllib.parse.quote_plus(self.connectionString)
        # create engine
        engine = sqlalchemy.create_engine(
            'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))

        return engine

    @logExecutionTime('Time taken to upload dataframe:')
    def uploadData(self, tableName, dataBaseSchema, dataFrame):
        '''
            Upload a pandas dataFrame
            to a database table <tableName>
        '''
        engine = self._getSQLAlchemyEngine()

        dataFrame.to_sql(
            tableName,
            con=engine,
            index=False,
            if_exists='append',
            method='multi', 
            chunksize=50,              
            schema=dataBaseSchema)

Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.

The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().

I am using:

ODBC Driver 17 for SQL Server
pandas 1.2.1
sqlalchemy 1.3.22
pyodbc 4.0.30

Does anyone know how to make it raise an exception if it fails?

Or why it is not uploading any data?

with MySQL i use the LOAD DATA INFILE function to insert large amount of data. it is 50X faster than insert statement MS SQL has similar functionality : https://learn.microsoft.com/fr-fr/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15 — Alex, Jan 28 '21 at 16:09
Try using `create_engine(…, fast_executemany=True)` along with `.to_sql(…, method=None)`. — Gord Thompson, Jan 28 '21 at 16:42
Do you have a test environment where you can work on this without messing up production? If so, then you could use `.create_engine(…, echo=True)` to see the SQL statements being sent to the server and/or use [ODBC tracing](https://github.com/mkleehammer/pyodbc/wiki/Troubleshooting-%E2%80%93-Generating-an-ODBC-trace-log) to see if anything looks amiss at that level of the conversation. — Gord Thompson, Jan 28 '21 at 21:06

Gord Thompson · Answer 1 · 2021-01-30T23:23:04.880

In rebuttal to this answer, if to_sql() was to fall victim to the issue described in

SQL Server does not finish execution of a large batch of SQL statements

then it would have to be constructing large anonymous code blocks of the form

-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…

and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()

crsr.execute(f"TRUNCATE TABLE {table_name}")

sql = "".join(
    [
        f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
        for i in range(num_rows)
    ]
)
crsr.execute(sql)

row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}")  # 316

Using to_sql() for that same operation works

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

df = pd.DataFrame(
    [(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)

engine = sa.create_engine(
    "mssql+pyodbc://@mssqlLocal64", fast_executemany=True
)

df.to_sql(
    table_name,
    engine,
    index=False,
    if_exists="replace",
)

with engine.connect() as conn:
    row_count = conn.execute(
        sa.text(f"SELECT COUNT(*) FROM {table_name}")
    ).scalar()
    print(f"row_count: {row_count}")  # 400

and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)

score 1 · Answer 2 · answered Jan 29 '21 at 18:53

1

Ok, this seems to be an issue with SQL Server itself.

SQL Server does not finish execution of a large batch of SQL statements

answered Jan 29 '21 at 18:53

MauroNeira

315
3
13

Pandas .to_sql fails silently randomly

2 Answers2