0

I currently have an SQL table that contains sensitive information. I have written a Python automation script that is able to process Excel files, then writes the dataframe to the table. What I want to do is be able to check if any of the rows in the dataframe that are about to be written are in the table. If they exist don't append them. Is there any efficient way of doing this for when the sql table gets increasingly large?

My idea was to scrape all of the current information in the SQL table, append the dataframe and then delete the duplicates. I'm sure there is a better way to do this as I don't want to have to copy the whole table and then replace the table each time I want to append values to it.

try:
    df_sql.to_sql(table_name, con=engine, schema='dbo', if_exists='append', index=False)
except:
    data = pd.read_sql(f'SELECT * FROM {table_name}', engine)
    new_data = pd.concat([data, df_sql])
    new_data.to_sql(table_name, con=engine, schema='dbo', if_exists='replace', index=False)

This is what I currently have. This allows me to deal with any new columns that have been added. I wanted to see if there was a way that I could do the same but with duplicated rows in the dataframe?

Cheers

Nhyi
  • 373
  • 1
  • 12
  • Are you familiar with the term "upsert"? – Gord Thompson Nov 02 '21 at 12:56
  • I have not come across that term before? Should I look into it? – Nhyi Nov 02 '21 at 13:48
  • Yes, because it sounds like what you want to accomplish: **up**date a row if it exists, otherwise in**sert** it. (Sometimes also called "MERGE".) – Gord Thompson Nov 02 '21 at 13:51
  • however in this case I don't want to update the row if it exists. Instead I just want to ignore it and then insert the rest. – Nhyi Nov 02 '21 at 14:18
  • With regards to merging, is it a wise idea to select the whole table and put it into a pandas dataframe and then do an upsert on that? – Nhyi Nov 02 '21 at 14:18
  • A more common approach is to push the "potential inserts and/or updates" DataFrame to a temporary table on the database server and then do a MERGE (or similar) from the temporary table into the main table. Example [here](https://stackoverflow.com/a/62388768/2144390). – Gord Thompson Nov 02 '21 at 14:24
  • Ah that does sound like a really good approach that I didn't think about. At the end of the insert I'm guessing you drop all the rows from the temporary table or delete the temporary table and recreate it each time you write to the database? The issue I am having getting my head around is can I do the SQL query within Python or is this something I will need to write and do separately? – Nhyi Nov 02 '21 at 14:43
  • You can do the whole thing from Python. You can even get SQLAlchemy to build the required query/queries for you. There's an example [here](https://stackoverflow.com/a/68855140/2144390) that just does updates, but it could be adapted to just do inserts. – Gord Thompson Nov 02 '21 at 15:02
  • I will look into this more, I'm still quite new to a lot of this and that code looks quite daunting. Thanks again! – Nhyi Nov 02 '21 at 15:18

0 Answers0