0

Aim is to detect fraud from this dataset.

I have two dataframes with columns as:

DF1[customerEmail, customerphone, customerdevice,customeripadd,NoOftransactions,Fraud] etc (168,11)

DF2[customerEmail,transactionid, payment methods,orderstatus] etc (623,11)

The customerEmail column is common in both the dataframes so it makes sense to merge tables on customerEmail.

The problem is that I have repeating customerEmail in DF2 with no reference in DF1. So when I merge using:

: DF3 = pd.merge(DF1, DF2, on='customerEmail')

the total size of rows and columns is (819,18) with repeating email ID having misleading data.

I want it to match using customerEmail from DF1 so my final dataframe DF3 should be somewhere equal to DF1.

Here's a link to the data for you to look at. Cheers https://www.kaggle.com/aryanrastogi7767/ecommerce-fraud-data

Yilmaz
  • 117
  • 2
  • 9
  • Could you provide some reproducible examples of your dataframes? – user32882 Aug 27 '20 at 14:22
  • I'm actually new to this site so I don't know how to ask questions with examples yet. I am sharing the kaggle data link for you to look at: https://www.kaggle.com/aryanrastogi7767/ecommerce-fraud-data – Yilmaz Aug 27 '20 at 14:27

2 Answers2

0

Try changing the how parameter to 'left'.

For example:

DF3 = DF1.merge(DF2, how='left', on='customerEmail')

Failing this, we prob need some more information.

GhandiFloss
  • 392
  • 1
  • 6
  • I have actually tried both left and right but I'm not sure if it's doing it correctly. I have updated the dataset link for you to take a look at. – Yilmaz Aug 27 '20 at 14:36
0

Maybe you should consider a different value for the option "how". By default, it is "inner" meaning deleting all rows without any match

Maybe the option "right", would help you, as then DF2 is the reference and DF1 is join to DF2.

Lazloo Xp
  • 858
  • 1
  • 11
  • 36