Pandas Identifying and Removing Similar/Duplicate Rows by Criteria

Question

I have a database table, that has an external process automatically inserting rows every so often. Because of the nature of the data, I need to be vigilant in detecting any possible 'duplicates' (It's actually only similar rows, because only certain columns are important) and removing them. I plan to use a database query to identify all similar rows into a dataframe and then sort and create a dataframe subset of 'keep' rows. The Idea is using a left join from original dataframe to the 'keep' dataframe and Boolean statement to identify the rows that need to be deleted per pandas get rows which are NOT in other dataframe. Can you tell me if I'm on the right track? I want to be very cautions about the logic deleting records from my database.

Original Dataframe looks like this:

   ID Account Type  Date       RowID
0  12  GOB     H    11/12/18   Az123
1  12  GOB     H    11/12/18   Az125
2  12  JPG     H    11/15/18   Az175
3  12  JPG     H    11/17/18   Az189
4  15  BLU     H    11/1/18    Ax127
5  15  BLU     D    11/18/18   Ax135
6  15  BLU     H    11/8/18    Ax175

It's important to keep a single record for each ID/Account combination, with preference to accounts with Type D first, and then by Oldest Date. Below is the desired keep subset.

Desired Keep Subset

   ID Account Type  Date       RowID
0  15  BLU     D    11/18/18   Ax135
1  12  GOB     H    11/12/18   Az123
2  12  JPG     H    11/15/18   Az175

Code: Edited with the help of W-B

df = pd.read_sql(similar_rows_sql)
df['helpkey']=df.Type.eq('D')
keep_df = df.sort_values(['Date']).sort_values(
          ['helpkey'], ascending=False).drop_duplicates(['ID','Account'], keep='first')
df_all = df.merge(keep_df, how='left', indicator=True)
df_remove = df_all.loc[df_all['_merge']== 'left_only']
for x in df_remove[RowID]:
    cursor.execute(remove_duplicate_sql, x)
connection.commit()

Type Concern Removed Thanks to W-B

My only remaining concern is this logic is pythonic and accurate to my intentions. Can anyone put my concerns to rest that this is accurate?

I've been using this version for a few days now and I think I'm comfortable with it. I will mark W-B as answer because he helped me solve the sorting issue which was the only known issue I had. — Drew Aschenbrener, Nov 14 '18 at 21:37

score 2 · Accepted Answer · answered Nov 08 '18 at 15:47

2

Using a helpkey

df['helpkey']=df.Type.eq('D')# return T when it is D , so we sort the helpkey , make sure D always at the end 

df.Date=pd.to_datetime(df.Date)
df.sort_values(['ID','helpkey','Date']).drop_duplicates(['ID','Account'],keep='last')
Out[163]: 
   ID Account Type       Date  RowID  helpkey
1  12     GOB    H 2018-11-12  Az125    False
3  12     JPG    H 2018-11-17  Az189    False
5  15     BLU    D 2018-11-18  Ax135     True

answered Nov 08 '18 at 15:47

BENY

317,841
20
164
234

Thank you! The use of the help key column does help me ensure I'm getting the right duplicates. I had to modify your example slightly to actually get the Desired 'Keep' Subset, but it is definitely the right approach. `keep_df = df.sort_values(['Date']).sort_values(['helpkey'], ascending=False).drop_duplicates(['ID','Account'], keep='first')` – Drew Aschenbrener Nov 08 '18 at 16:59

Pandas Identifying and Removing Similar/Duplicate Rows by Criteria

1 Answers1