I currently have an SQL table that contains sensitive information. I have written a Python automation script that is able to process Excel files, then writes the dataframe to the table. What I want to do is be able to check if any of the rows in the dataframe that are about to be written are in the table. If they exist don't append them. Is there any efficient way of doing this for when the sql table gets increasingly large?
My idea was to scrape all of the current information in the SQL table, append the dataframe and then delete the duplicates. I'm sure there is a better way to do this as I don't want to have to copy the whole table and then replace the table each time I want to append values to it.
try:
df_sql.to_sql(table_name, con=engine, schema='dbo', if_exists='append', index=False)
except:
data = pd.read_sql(f'SELECT * FROM {table_name}', engine)
new_data = pd.concat([data, df_sql])
new_data.to_sql(table_name, con=engine, schema='dbo', if_exists='replace', index=False)
This is what I currently have. This allows me to deal with any new columns that have been added. I wanted to see if there was a way that I could do the same but with duplicated rows in the dataframe?
Cheers