Panda's MERGE on customerEmail column having duplicates

Question

Aim is to detect fraud from this dataset.

I have two dataframes with columns as:

DF1[customerEmail, customerphone, customerdevice,customeripadd,NoOftransactions,Fraud] etc (168,11)

DF2[customerEmail,transactionid, payment methods,orderstatus] etc (623,11)

The customerEmail column is common in both the dataframes so it makes sense to merge tables on customerEmail.

The problem is that I have repeating customerEmail in DF2 with no reference in DF1. So when I merge using:

: DF3 = pd.merge(DF1, DF2, on='customerEmail')

the total size of rows and columns is (819,18) with repeating email ID having misleading data.

I want it to match using customerEmail from DF1 so my final dataframe DF3 should be somewhere equal to DF1.

Could you provide some reproducible examples of your dataframes? — user32882, Aug 27 '20 at 14:22
I'm actually new to this site so I don't know how to ask questions with examples yet. I am sharing the kaggle data link for you to look at: https://www.kaggle.com/aryanrastogi7767/ecommerce-fraud-data — Yilmaz, Aug 27 '20 at 14:27

score 0 · Answer 1 · answered Aug 27 '20 at 14:29

0

Try changing the how parameter to 'left'.

For example:

DF3 = DF1.merge(DF2, how='left', on='customerEmail')

Failing this, we prob need some more information.

answered Aug 27 '20 at 14:29

GhandiFloss

I have actually tried both left and right but I'm not sure if it's doing it correctly. I have updated the dataset link for you to take a look at. – Yilmaz Aug 27 '20 at 14:36

score 0 · Answer 2 · answered Aug 27 '20 at 14:30

0

Maybe you should consider a different value for the option "how". By default, it is "inner" meaning deleting all rows without any match

Maybe the option "right", would help you, as then DF2 is the reference and DF1 is join to DF2.

answered Aug 27 '20 at 14:30

Lazloo Xp

2 Answers2