1

Problem description

In an ETL pipeline, we update a table from an SQL database with a pandas dataframe. The table has about 2 milion rows, and the dataframe updates approximately 1 million of them. We do it with SQLAlchemy in Python, and the database is SQL Server (I think this is not too relevant for the question, but I'm writing it for the sake of completeness).

At the moment, the code is "as found", the update consisting of the following steps:

  1. Split the dataframe in many dataframes (the number appears to be fix and arbitrary, does not depend on the dataframe size).
  2. For each sub-dataframe, do an update query.

As it is, the process takes what (in my admittedly very limited SQL experience) appears to be too much time, about 1-2 hours. The table schema consists of 4 columns:

  • An id as the primary key
  • 2 columns that are foreign keys (primary keys in their respective tables)
  • A 4th column

Questions

  • What can I do to make the code more efficient, and faster? Since the UPDATE is done in blocks, I'm unsure of whether the index is re-calculated every time (since the id value is not changed I don't know why that would be the case). I also don't know how the foreign key values (which could change for a given row) enter the complexity calculation.

  • At what point does it make sense, if at any, to insert all the rows into a new auxiliary table, re-calculate the index only at the end, truncate the original table and copy the auxiliary table into it? Are there any subtleties with this approach, with the indices, foreign keys, etc.?

nabla
  • 235
  • 2
  • 11

0 Answers0