I have a database table, that has an external process automatically inserting rows every so often. Because of the nature of the data, I need to be vigilant in detecting any possible 'duplicates' (It's actually only similar rows, because only certain columns are important) and removing them. I plan to use a database query to identify all similar rows into a dataframe and then sort and create a dataframe subset of 'keep' rows. The Idea is using a left join from original dataframe to the 'keep' dataframe and Boolean statement to identify the rows that need to be deleted per pandas get rows which are NOT in other dataframe. Can you tell me if I'm on the right track? I want to be very cautions about the logic deleting records from my database.
Original Dataframe looks like this:
ID Account Type Date RowID
0 12 GOB H 11/12/18 Az123
1 12 GOB H 11/12/18 Az125
2 12 JPG H 11/15/18 Az175
3 12 JPG H 11/17/18 Az189
4 15 BLU H 11/1/18 Ax127
5 15 BLU D 11/18/18 Ax135
6 15 BLU H 11/8/18 Ax175
It's important to keep a single record for each ID/Account combination, with preference to accounts with Type D first, and then by Oldest Date. Below is the desired keep subset.
Desired Keep Subset
ID Account Type Date RowID
0 15 BLU D 11/18/18 Ax135
1 12 GOB H 11/12/18 Az123
2 12 JPG H 11/15/18 Az175
Code: Edited with the help of W-B
df = pd.read_sql(similar_rows_sql)
df['helpkey']=df.Type.eq('D')
keep_df = df.sort_values(['Date']).sort_values(
['helpkey'], ascending=False).drop_duplicates(['ID','Account'], keep='first')
df_all = df.merge(keep_df, how='left', indicator=True)
df_remove = df_all.loc[df_all['_merge']== 'left_only']
for x in df_remove[RowID]:
cursor.execute(remove_duplicate_sql, x)
connection.commit()
Type Concern Removed Thanks to W-B
My only remaining concern is this logic is pythonic and accurate to my intentions. Can anyone put my concerns to rest that this is accurate?