Find Duplicates on Multiple Criteria

Question

I have a set of financial transactions with a Date, Amount, Description and Source and I want to find transactions where the amount is the same, the date is within one day, but the sources are different. The source should be different because the transactions are an import from many sources and each source has unique entries.

For example, I would want to find that row 1 and 3 are duplicates:

'date','amount','description','source'
1/5/2018, 5.28, 'McDonalds', 'BankOfAmerica'
1/6/2018, 8.44, 'Starbucks', 'BankOfAmerica'
1/5/2018, 5.28, 'McDonalds Rest', 'BoA'
2/10/2018, 22.72, 'Chipolte', 'Chase'
3/10/2018, 4.58, 'Wendys', 'BoA'

I tried in Python and I can find duplicates with:

df_no_dups = df.drop_duplicates(subset=['amount','dates'])
df_dups = df[~df.isin(df_no_dups)].dropna()

but this is an exact date match and then I have to run another script to make sure the sources were different.

I also tried to groupby amounts and then iterate inside those to find where dates are close and sources are different, but I couldn't figure out the details of groups.

Other approaches could be with SQL or in the spreadsheet (google) where the transactions are.

can you make a [reproducible](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) example please, so we can replicate the question. Thanks — anky, Mar 17 '19 at 11:16

score 2 · Answer 1 · edited Mar 17 '19 at 21:32

Consider the following data(added a row in row 3 for better understanding)

data = pd.compat.StringIO("""5 Jan, 5.28, 'McDonalds', 'BankOfAmerica'
6 Jan, 8.44, 'Starbucks', 'BankOfAmerica'
5 Jan, 5.28, 'McDonalds Rest', 'BoA'
5 Jan, 5.28, 'McDonalds Rest', 'BankOfAmerica'
10 Feb, 22.72, 'Chipolte', 'Chase'""")
df = pd.read_csv(data,header=None)
df.columns=['Date','Amount','Dscription','Source']
print(df)

 Date  Amount         Dscription            Source
0   5 Jan    5.28        'McDonalds'   'BankOfAmerica'
1   6 Jan    8.44        'Starbucks'   'BankOfAmerica'
2   5 Jan    5.28   'McDonalds Rest'             'BoA'
3   5 Jan    5.28   'McDonalds Rest'   'BankOfAmerica'
4  10 Feb   22.72         'Chipolte'           'Chase'

For duplicates and different sources:

df_dups =df[df.duplicated(['Date','Amount'],keep=False)]
df_dups =df_dups.drop_duplicates(['Date','Amount','Source'],keep=False)
print(df_dups)


    Date  Amount         Dscription            Source
0  5 Jan    5.28        'McDonalds'   'BankOfAmerica'
2  5 Jan    5.28   'McDonalds Rest'             'BoA'

For no dups( pulling all the other rows basically df-df_dup):

no_dups=df.loc[~df.index.isin(df_dups.index)]
print(no_dups)

     Date    Amount      Dscription            Source
1   6 Jan    8.44        'Starbucks'   'BankOfAmerica'
3   5 Jan    5.28   'McDonalds Rest'   'BankOfAmerica'
4  10 Feb   22.72         'Chipolte'           'Chase'

I mean if the date is 5-Jan and 6-Jan but the sources are different and the amounts are the same, then it is still a duplicate because the financial records are sometime off by a day. — bonhoffer, Mar 17 '19 at 11:48
wait! this doesn't work. the second step just removes all those with duplicate sources. — bonhoffer, Mar 17 '19 at 12:02
@bonhoffer my bad, missed to insert date while dropping dups: use `df_dups =df_dups.drop_duplicates(['Date','Source'])` edited the answer — anky, Mar 17 '19 at 12:09
So the approach is going to remove lots of duplicates just because they share a common source. — bonhoffer, Mar 17 '19 at 12:10

score 1 · Accepted Answer · answered Mar 17 '19 at 11:37

1

use exists

select t1.* from table_name t1
where exists( select 1 from table_name t2 
             where t2.date=t1.date and t2.amount=t1.amount and t1.source<>t2.source)

answered Mar 17 '19 at 11:37

Zaynul Abadin Tuhin

31,407
5
33
63

wow -- this is pretty strait forward -- thanks -- and I could use a datediff function or something on that first = if I wanted dates to differ by a day – bonhoffer Mar 17 '19 at 11:40

Find Duplicates on Multiple Criteria

2 Answers2