0

I am using the following modules:

  • SQLAlchemy==2.0.18
  • pyodbc==4.0.38
  • pandas==2.0.3

When I am trying to insert dataframe with to_sql() function it takes me 89 seconds for 44,777 rows and 28 columns which appear to be too slow for me. Back in the days I used earlier version of SQLAlchemy and it worked for me with the same dataframe size in 8 seconds.

How can I make it work faster?

I'm willing to make the inesrtion less than 15 seconds for 44,777 rows.

If I am using fast_executemany=True, I am receiving the following error:

(pyodbc.ProgrammingError) ('String data, right truncation: length 38 buffer 20', 'HY000')

My code: connection.py

import sqlalchemy
import pyodbc
import urllib.parse

class DatabaseConnection:

    def __init__(self, connection_string: str) -> None:
        try:
            self.connection_string = str(connection_string)
        except Exception as err:
            raise err

    def get_engine(self) -> sqlalchemy.Engine:
        try:
            params = urllib.parse.quote_plus(self.connection_string)
            engine: sqlalchemy.Engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
            return engine
        except Exception as err:
            raise err

    def connect_to(self) -> sqlalchemy.Connection:
        try:
            engine = self.get_engine()
            conn = engine.connect()
            return conn
        except Exception as err:
            raise err

repository.py

class StagingRepository:
    def __init__(self, connection_string: str) -> None:
        self.connection = DatabaseConnection(connection_string)
        self.log = logger.getLogger('project.StagingRepository')

    def write_data_to_stage(self, row: pd.Series, df: pd.DataFrame, dtypedict: dict):
        try:
            self.drop_stg_table(row)
            engine = self.connection.get_engine()
            self.log.info('Start insertion to staging database')
            chunk_size = 400
            start_time = time.time()
            for i in range(0, len(df), chunk_size):
                chunk = df[i:i+chunk_size]
                chunk.to_sql(str(row['DataTable']), engine, if_exists='append', index=False, dtype=dtypedict)
            end_time = time.time() 
            elapsed_time = end_time - start_time
            self.log.info(f'Dataframe insertion took {elapsed_time:.2f} seconds')
            result = df.shape[0]
            return result
        except Exception as err:    
            self.log.error(f'Insertion for File ID: {row["FileID"]}, Name:{row["FileName"]} into table: {row["DataTable"]} failed\n {err}', exc_info=True)
            raise err

def drop_stg_table(self, row: pd.Series):
        try:
            conn: Connection = self.connection.connect_to()
            query = text(f"DROP TABLE IF EXISTS {row['DataTable']};")
            conn.execute(query)
            conn.commit()
        except Exception as err:
            raise err        
        finally:
            conn.close()
  • hi Yonatan! Welcome to StackOverflow. Is your computer a potato? 44,000 rows doesn't seem like that many! – Mark Aug 16 '23 at 14:35
  • Have you tried https://stackoverflow.com/a/62671681/3275464 ? – Learning is a mess Aug 16 '23 at 14:36
  • @Learningisamess The problem is that I am losing the functionality of creating a table with the datatypes, which I pass in to_sql(dtype=dtypedict) a dictionary with columns:datatypes. Do you have a solution for that? – Yonatan Darmon Aug 16 '23 at 14:40
  • @Mark That is the point, when I seperated the function from the program with sqlalchemy 1.3 the to_sql() insert same data in 8 seconds but the newer version made problems. I can't go back with the version unfortunately. – Yonatan Darmon Aug 16 '23 at 14:42
  • I don't see where you are specifying `fast_executemany=True`. If you are trying to add it to the ODBC connection string that's not where it goes. – Gord Thompson Aug 16 '23 at 14:49
  • @GordThompson I use it in the create_engine(), but when I use it then I get the following error: (pyodbc.ProgrammingError) ('String data, right truncation: length 38 buffer 20', 'HY000') Thats why I removed it from the code – Yonatan Darmon Aug 16 '23 at 15:06
  • Consider creating a [mcve] to illustrate your issue. For example, [this code](https://gist.github.com/gordthompson/6530ebb8fac7103ee91849c74141382f) will reproduce your error with `fast_executemany=True`, but it should be obvious why it is failing. – Gord Thompson Aug 16 '23 at 23:16
  • @GordThompson well, I went through the web and it seems like the fast_executemany=True is known problem when you are using many types for each column with pandas. I've 28 columns, mixed with NVARCHAR(50-300), decimals(), datetime, int, bigint etcetera... which I try to insert into the SQL Server. I don't understand why its obvious why it is failing? It says I am inserting a string bigger than the limit but I already checked and I didn't. – Yonatan Darmon Aug 17 '23 at 05:46
  • If you want to avoid problems with `fast_executemany=True` and you are working with SQL Server 2016+ then you could try using [this](https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205) – Gord Thompson Aug 17 '23 at 09:13
  • @GordThompson unfortunately the fast_executemany=True isn't working for me. Since I am inserting decimal data which according to pyodbc there is a bug for big amount of data which include decimal types. – Yonatan Darmon Aug 31 '23 at 07:13

0 Answers0