3

I have several large pandas dataframes (about 30k+ rows) and need to upload a different version of them daily to a MS SQL Server db. I am trying to do so with the to_sql pandas function. On occasion, it will work. Other times, it will fail - silently - as if the code uploaded all of the data despite not having uploaded a single row.

Here is my code:

class SQLServerHandler(DataBaseHandler):
    
    ...


    def _getSQLAlchemyEngine(self):
        '''
            Get an sqlalchemy engine
            from the connection string

            The fast_executemany fails silently:

            https://stackoverflow.com/questions/48307008/pandas-to-sql-doesnt-insert-any-data-in-my-table/55406717
        '''
        # escape special characters as required by sqlalchemy
        dbParams = urllib.parse.quote_plus(self.connectionString)
        # create engine
        engine = sqlalchemy.create_engine(
            'mssql+pyodbc:///?odbc_connect={}'.format(dbParams))

        return engine

    @logExecutionTime('Time taken to upload dataframe:')
    def uploadData(self, tableName, dataBaseSchema, dataFrame):
        '''
            Upload a pandas dataFrame
            to a database table <tableName>
        '''
        engine = self._getSQLAlchemyEngine()

        dataFrame.to_sql(
            tableName,
            con=engine,
            index=False,
            if_exists='append',
            method='multi', 
            chunksize=50,              
            schema=dataBaseSchema)

Switching the method to None seems to work properly but the data takes an insane amount of time to upload (30+ mins). Having multiple tables (20 or so) a day of this size discards this solution.

The proposed solution here to add the schema as a parameter doesn't work. Neither does creating a sqlalchemy session and passsing it to the con parameter with session.get_bind().

I am using:

  • ODBC Driver 17 for SQL Server
  • pandas 1.2.1
  • sqlalchemy 1.3.22
  • pyodbc 4.0.30

Does anyone know how to make it raise an exception if it fails?

Or why it is not uploading any data?

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
MauroNeira
  • 315
  • 3
  • 13
  • with MySQL i use the LOAD DATA INFILE function to insert large amount of data. it is 50X faster than insert statement MS SQL has similar functionality : https://learn.microsoft.com/fr-fr/sql/t-sql/statements/bulk-insert-transact-sql?view=sql-server-ver15 – Alex Jan 28 '21 at 16:09
  • Try using `create_engine(…, fast_executemany=True)` along with `.to_sql(…, method=None)`. – Gord Thompson Jan 28 '21 at 16:42
  • @GordThompson, unfortunately, this fails silently too – MauroNeira Jan 28 '21 at 17:48
  • Do you have a test environment where you can work on this without messing up production? If so, then you could use `.create_engine(…, echo=True)` to see the SQL statements being sent to the server and/or use [ODBC tracing](https://github.com/mkleehammer/pyodbc/wiki/Troubleshooting-%E2%80%93-Generating-an-ODBC-trace-log) to see if anything looks amiss at that level of the conversation. – Gord Thompson Jan 28 '21 at 21:06

2 Answers2

2

In rebuttal to this answer, if to_sql() was to fall victim to the issue described in

SQL Server does not finish execution of a large batch of SQL statements

then it would have to be constructing large anonymous code blocks of the form

-- Note no SET NOCOUNT ON;
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (0, 'row0');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (1, 'row1');
INSERT INTO gh_pyodbc_262 (id, txt) VALUES (2, 'row2');
…

and that is not what to_sql() is doing. If it were, then it would start to fail well below 1_000 rows, at least on SQL Server 2017 Express Edition:

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

cnxn = pyodbc.connect("DSN=mssqlLocal64", autocommit=True)
crsr = cnxn.cursor()

crsr.execute(f"TRUNCATE TABLE {table_name}")

sql = "".join(
    [
        f"INSERT INTO {table_name} ([id], [txt]) VALUES ({i}, 'row{i}');"
        for i in range(num_rows)
    ]
)
crsr.execute(sql)

row_count = crsr.execute(f"SELECT COUNT(*) FROM {table_name}").fetchval()
print(f"row_count: {row_count}")  # 316

Using to_sql() for that same operation works

import pandas as pd
import pyodbc
import sqlalchemy as sa

print(pyodbc.version)  # 4.0.30

table_name = "gh_pyodbc_262"
num_rows = 400
print(f" num_rows: {num_rows}")  # 400

df = pd.DataFrame(
    [(i, f"row{i}") for i in range(num_rows)], columns=["id", "txt"]
)

engine = sa.create_engine(
    "mssql+pyodbc://@mssqlLocal64", fast_executemany=True
)

df.to_sql(
    table_name,
    engine,
    index=False,
    if_exists="replace",
)

with engine.connect() as conn:
    row_count = conn.execute(
        sa.text(f"SELECT COUNT(*) FROM {table_name}")
    ).scalar()
    print(f"row_count: {row_count}")  # 400

and indeed will work for thousands and even millions of rows. (I did a successful test with 5_000_000 rows.)

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
1

Ok, this seems to be an issue with SQL Server itself.

SQL Server does not finish execution of a large batch of SQL statements

MauroNeira
  • 315
  • 3
  • 13