PROBLEM
I have the following dataframe (I am working with confidential data, so I will provide a sample):
import pandas as pd
df = pd.read_csv('file.csv')
df
name weight height smoke_or_not
0 Ajitesh 84 183 no
1 Shail 79 186 yes
2 Seema 67 158 no
3 Nidhi 52 155 no
I am saving this dataframe as a sql table using the following code. It is not included here, but I have already made de connection to the SQL Server database.
df.to_sql('table name', engine, if_exists='replace', index=False)
This dataframe is coming from a file that is beging updated daily. This update may contain new rows or even modifications of the existing rows.
For example, here I have modifications in the first row and a new row:
name weight height smoke_or_not
0 John 84 183 yes
1 Shail 79 186 yes
2 Seema 67 158 no
3 Nidhi 52 155 no
4 Mary 70 162 no
QUESTION
How could I identify the changes in existing rows and the new rows to run INSERT and UPDATE queries on SQLAlchemy?
I have my df
coming from the file that is being updated daily.
I also have a df2
that is coming from the SQL database with a SELECT * from table query.
WHAT I HAVE TRIED
I have tried using:
changes = pd.merge(df, df2, how='outer', indicator=True)
right_only = changes[changes['_merge'] == 'right_only']
right_only
name weight height smoke_or_not _merge
4 John 84 183 yes right_only
5 Mary 70 162 no right_only
With this I can get a new dataframe with all the rows that were changed + the new row. However, there are no distinction about which rows were changed and which ones are the new ones. So I can't separate which rows I am going to UPDATE or INSERT. Is there any easier way to do that?