Applying filter functions on a dataframe

Question

I have a dataset of sales and purchases on a market place, looking a little like this.

User_ID | Transaction_Type |   Date   | Amount
    1   |      Sale        | 01/01/14 | 200.00
    2   |     Purchase     | 01/01/14 |  30.00
    ...

I need to filter out customers who have only bought or sold something versus customers who have bought and sold something at least once.

I am trying to create a function that will check if a user has done both or not. If a user has done both, then the user will be marked as a yes, otherwise no.

So far I have tried doing this,

def user_filter(df):
if df in df['User_ID'].filter(lambda x : ((x['Transaction_Type']=='Sale').any())&((x['Transaction_Type']=='Purchase').any())):
    return 'yes'
else:
    return 'no'
df['cross'] = df['User_ID'].apply(user_filter)

Let's assume later on in the dataset that User_ID 1 will come back as a Purchase. I would hope it would return as :

User_ID | Transaction_Type |   Date   | Amount | cross
    1   |      Sale        | 01/01/14 | 200.00 |  yes
    2   |     Purchase     | 01/01/14 |  30.00 |   no

but the following error returns:

'int' object is not subscriptable

When i apply it to the whole dataframe as opposed to just the series, it returns:

KeyError: ('User_ID', 'occurred at index User_ID')

Can you explain the logic for `cross`? IIUC, `cross`=`yes` requires BOTH `Sale` and `Purchase` for the **same** `User_ID`. Correct? Your expected output, you show `cross=yes` for **only** `Sale`. This confuses me. — edesz, Apr 07 '19 at 19:37
You are applying to one series - User_Id try for the whole df — Konrad, Apr 07 '19 at 19:39
@KonradSitarz That returns `KeyError: ('User_ID', 'occurred at index User_ID')` — Kbbm, Apr 07 '19 at 19:40
@edesz the following `df['User_ID'].filter(lambda x : ((x['Transaction_Type']=='Sale').any())&((x['Transaction_Type']=='Purchase').any()))` will filter only users who are both. — Kbbm, Apr 07 '19 at 19:43
@KyleMcComb: `User_ID=1` does not have a `Sale`. Yet this user is assigned `cross=yes`. According to the verbal instructions you have posted, I would have thought they should be assigned `cross=no` and `cross=yes` should only be assigned if the **same** `User_ID` has `Sale` and `Purchase`. Please clarify — edesz, Apr 07 '19 at 19:53
@edesz There are about 1000 + more observations in my dataframe and user 1 appears again as a sale. I just truncated it. apologies for not making that clear. — Kbbm, Apr 07 '19 at 20:01

edesz · Accepted Answer · 2019-04-07T21:03:01.083

One possible approach to doing this is using groupby and then, instead of aggregating, simply list the Transaction_Types in each group like show in this SO post. Then, just get the length of the list....if the length is 2 this means both Sale and Purchase are present for that user. On the other hand, if the length is 1, then only one of Sale or Purchase is present for that user.

Generate some data per the OP (I added a third record to make the output more explicit)

d = [['User_ID', 'Transaction_Type', 'Date', 'Amount'],
    [1, 'Sale', '01/01/14', 200],
    [1, 'Purchase','01/02/14',300],
    [2, 'Purchase','01/01/14',30],]

Perform the GROUP BY

df_users = df.groupby('User_ID')['Transaction_Type'].apply(list).reset_index(drop=False)
df_users.rename(columns={'Transaction_Type':'Transactions'}, inplace=True)

print(df_users)
   User_ID      Transactions
0        1  [Sale, Purchase]
1        2        [Purchase]

Now append a cross column to the grouped DataFrame and populate the cross column as required

df_users['cross'] = 'no'
df_users.loc[df_users.Transactions.apply(len)==2, 'cross'] = 'yes'

print(df_users)
   User_ID      Transactions cross
0        1  [Sale, Purchase]   yes
1        2        [Purchase]    no

EDIT 1

Alternatively, drop the apply steps and just use size

df_users = df.groupby('User_ID')['Transaction_Type'].size().reset_index(drop=False)
df_users['cross'] = 'no'
df_users.loc[df_users.Transactions==2, 'cross'] = 'yes'

print(df_users)
   User_ID  Transactions cross
0        1             2   yes
1        2             1    no

EDIT 2

If you wanted to append the cross column to the source DataFrame, then add these 2 lines of code to the above

df = df.merge(df_users, on='User_ID')
df.drop(columns=['Transactions'], inplace=True)

print(df)
   User_ID Transaction_Type      Date  Amount cross
0        1             Sale  01/01/14     200   yes
1        1         Purchase  01/02/14     300   yes
2        2         Purchase  01/01/14      30    no

Applying filter functions on a dataframe

1 Answers1