1

I have this huge dataset (100M rows) of consumer transactions that looks as follows:

df = pd.DataFrame({'id':[1, 1, 2, 2, 3],'brand':['a','b','a','a','c'], 'date': ['01-01-2020', '01-02-2020', '01-05-2019', '01-06-2019', '01-12-2018']})

For each row (each transaction), I would like to check if the same person (same "id") bought something in the past for a different brand. The resulting dataset should look like this:

    id  brand  date        check
0   1   a      01-01-2020  0  
1   1   b      01-02-2020  1
2   2   a      01-05-2019  0
3   2   a      01-06-2019  0
4   3   c      01-12-2018  0

Now, my solution was:

def past_transaction(row):
    x = df[(df['id'] == row['id']) & (df['brand'] != row['brand']) & (df['date'] < row['date'])]
    if x.shape[0]>0:
        return 1
    else:
        return 0

df['check'] = df.appy(past_transaction, axis=1)

This works well, but the performance is abysmal. Is there a more efficient way to do this (with or without Pandas)? Thanks!

user2447387
  • 173
  • 1
  • 3
  • 12
  • 3
    first, you are using `apply` when you have core functions that will make the performance 100 times faster than what you have right now. second, pandas is not meant for large data work flows. put it in a SQL db and throw an index on your key columns or use something like `spark` or `dask` – Umar.H Nov 27 '20 at 22:51

2 Answers2

3

I would personally use two booleans,

First check if the id is duplicated. Second is to check for those that are not duplicated id & brand

import numpy as np 

s = df.duplicated(subset=['id'],keep='first')
s1 = ~df.duplicated(subset=['id','brand'],keep=False)
df['check'] = np.where(s & s1,1,0)


   id brand        date  check
0   1     a  01-01-2020      0
1   1     b  01-02-2020      1
2   2     a  01-05-2019      0
3   2     a  01-06-2019      0
4   3     c  01-12-2018      0
Umar.H
  • 22,559
  • 7
  • 39
  • 74
0

A) Use Pandas builtin functions

First step would be to utilize pandas instead of making your own function:

df['check'] = np.logical_and(df.id.duplicated(), ~df[['id','brand']].duplicated())

It will make your code faster already!

B) Take advantage of hardware

Opt-in to utilize all the cores you have in your machine if your RAM permits. You can use modin.pandas or any alternative for that. I recommended this because its minimal changes and will provide exponential speed-up depending on your machine's configuration

C) Big Data Frameworks

If it is a big data problem you should be already utilizing dask or spark dataframes which are meant to handle Big Data as pandas isn't meant to handle such large volumes of data.

Some things I found effective while dealing with a similar problem.

Hamza
  • 5,373
  • 3
  • 28
  • 43