UPSERT with Pandas DF in SQLAlchemy and Azure SQL Server taking too long for large table. How can I optimize it?

Question

I am getting data from an API using Python and then transforming this data into a Pandas Dataframe. This dataframe has a column Date and more 50 columns. I have retrieved the data from the whole past week.

Using SQLAlchemy, I created a table in Azure SQL Server and loaded the dataframe containing the data from the whole past week using df.to_sql().

df.to_sql('table', engine, index=False, dtype=types)

Any past date from this API can be changed.

For example, I stored 60k rows with the column Date as of 05/03/2023, but if I query the API tomorrow and ask for this specific date (05/03/2023), there may be 62k rows (2k new rows) and even rows that changed. So, I want to do an upsert based on a selected date (such as 05/03/2023).

I have tried following this: How to upsert pandas DataFrame to Microsoft SQL Server table?

So, I created a #temp_table to store the new dataframe with the updated data from 05/03/2023 and tried to run the query below that was based in the other question I mentioned.

OBS: the columns provided in the example below are just for giving an example. In real life, I have 51 columns (Date + 50).

OBS 2: I don't have a primary key, so in the ON I am comparing the dates.

# Just to be clear, this is how I am defining my engine variable with SQLAlchemy
engine = sqlalchemy.create_engine('mssql+pyodbc:///?odbc_connect=%s' % url_db,fast_executemany=True,use_setinputsizes=False)

# Creating #temp_table with new df
new_df.to_sql('#temp_table', engine,if_exists='replace', index=False, dtype=types)

# Creating the upsert query
query = (
        f"MERGE table WITH (HOLDLOCK) AS main "
        f"USING (SELECT date, name, age  FROM #temp_table) AS temp "
        f"ON ([main].[date] = [temp].[date]) "
        f"WHEN MATCHED THEN "
            f"UPDATE SET [main].[date] = [temp].[date], [main].[name] = [temp].[name], [main].[age] = [temp].[age]"
        f"WHEN NOT MATCHED THEN "
            f"INSERT (date, name, age) VALUES ([temp].[date], [temp].[name], [temp].[age]);"
)

# Running the query
with engine.begin() as conn:
    try:
        conn.execute(text(query)) #GETING STUCK HERE!!!
        conn.commit()
    except Exception as e: 
        print(str(e))
        conn.rollback()

I have done some print() and found out that everything is running properly, but it gets stuck in the conn.execute(text(query)). The code has been running for 30min and it didn't finish.

Is there anything wrong with my code? Or is it caused by the large amount of data? How can I optimize it?

Hi! No, it's not indexed. I am pretty new to SQL. Do you think that indexes could help? — carolina131, May 08 '23 at 22:59
Also: (1) Think carefully about whether matching on just `[main].[date] = [temp].[date]` is sufficient. If you have 60k rows for date 05/03/2023 then that column alone is nowhere near a unique identifier for a row. (2) If you are matching on `[main].[date] = [temp].[date]` then `WHEN MATCHED THEN UPDATE SET [main].[date] = [temp].[date]` is unnecessary because they already match. — Gord Thompson, May 08 '23 at 23:02
"Do you think that indexes could help?" - Definitely. Without an index on that column the database will have to do a *table scan* to find the matching row(s). — Gord Thompson, May 08 '23 at 23:03
Now I get it, it makes sense. I checked the columns and none are 100% unique, so I thought about creating indexes myself. So,I would add the index according to the row position. My question is, how would I use it for the upsert? Mainly for the update. Like, I will have 120k rows for two days and index from 1-120k. Then a temp table with 62k rows and index 0-62k for a single day. How could they match? — carolina131, May 08 '23 at 23:15
Please share the query plan via https://brentozar.com/pastetheplan, please also show relevant tables and index definitions. Most likely you need a clustered index/primary key on `date` — Charlieface, May 08 '23 at 23:26
"I checked the columns and none are 100% unique" - You don't need any one column to be unique, you need a *combination of columns* that uniquely identify a row. Adding an IDENTITY column to the main table won't really help unless your data source for the temp table has some way of knowing what those identity values are for existing rows in the main table. (That's sort of a "chicken and egg" problem.) — Gord Thompson, May 08 '23 at 23:31
@Charlieface I have copied the plan XML and tried to paste it to the link you provided. I received the following error: The supplied XML did not parse correctly. Are you sure you have a valid query plan XML text?. Is that because my plan is monstrous or there is something wrong with it? — carolina131, May 09 '23 at 01:22
Thank you for the explanation, @GordThompson! Is there any link, material or reference I can use? I understand now that I need to find "a combination of columns that uniquely identify a row.", but after that and creating the clustered index, I feel lost as how I should organize my upsert query. — carolina131, May 09 '23 at 01:24
No, possibly because PasteThePlan doesn't like new versions of SQL Server. If you can upload it somewhere like PasteBin I can sort it out. Just needs a slight change to the first line of the XML — Charlieface, May 09 '23 at 01:43
Are dates unique on both sides? otherwise, wouldn't you be updating "too many" if you get too many hits on both sides? Merge usually just crashes when that happens — siggemannen, May 09 '23 at 07:50
@siggemannen, the dates are not unique. I tried creating a clustered index with date and other columns, but there is no combination of columns that result in a unique row. Is it still possible to upsert? — carolina131, May 09 '23 at 13:03
Yes but the data will be crap. consider you have two rows in sources: date 20230101, measure 1, date 20230101 measure 2. you try to upsert another table with two rows date 20230101, measure 2, date 20230101 measure 3. if you only join by dates, it will be random which measure will be assigned to both target rows. — siggemannen, May 09 '23 at 13:14
A `Merge` must be joining on a unique join condition, so you need a way of making a unique key out of any number of columns in the table. — Charlieface, May 09 '23 at 13:28

UPSERT with Pandas DF in SQLAlchemy and Azure SQL Server taking too long for large table. How can I optimize it?

0 Answers0