I have this huge dataset (100M rows) of consumer transactions that looks as follows:
df = pd.DataFrame({'id':[1, 1, 2, 2, 3],'brand':['a','b','a','a','c'], 'date': ['01-01-2020', '01-02-2020', '01-05-2019', '01-06-2019', '01-12-2018']})
For each row (each transaction), I would like to check if the same person (same "id") bought something in the past for a different brand. The resulting dataset should look like this:
id brand date check
0 1 a 01-01-2020 0
1 1 b 01-02-2020 1
2 2 a 01-05-2019 0
3 2 a 01-06-2019 0
4 3 c 01-12-2018 0
Now, my solution was:
def past_transaction(row):
x = df[(df['id'] == row['id']) & (df['brand'] != row['brand']) & (df['date'] < row['date'])]
if x.shape[0]>0:
return 1
else:
return 0
df['check'] = df.appy(past_transaction, axis=1)
This works well, but the performance is abysmal. Is there a more efficient way to do this (with or without Pandas)? Thanks!