I have a dataframe with a sample attached. I want to give loyalty labels for all the users in my dataframe.
The labels and their conditions are:
first_time_visitor
- any user with session = 0 (see the first line in loyalty column)frequent_visitor
- any user with session > 0 and count_session_products > 0first_time_customer
- first time checkout:confirmation appears in the type columnrepeat_customer
- second time of checkout:confirmation appears in the type columnloyal_customer
- third time of checkout:confirmation appears in the type column
I am having trouble calculating first_time_customer
, repeat_customer
and loyal_customer
labels.
I have had a couple of ideas only, one being to use first_valid_index()
or argmax()
to find the index and somehow use that in a condition to create the first_time_customer
label.
(df_customer_1.type.values == 'checkout:confirmation').argmax()
df_customer_1[df_customer_1.type == 'checkout:confirmation'].first_valid_index()
But I am not sure how to implement these conditions.
In the end I would like the loyalty column in the picture to look like:
first_time_visitor
frequent_visitor
first_time_customer
repeat_customer
frequent_visitor
loyal_customer
Can anyone guide me as to how I can achieve this? I do not have any code to show for it so far therefore you would probably have to come up with something. We can try both pandas or a typical pythonic way.
This is what I have so far, sample from the data:
data = {
'user_id': [
'9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX',
'9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX'
],
'timestamp': [
1612139269, 1612139665, 1612139579,
1612141096, 1612143046, 1612143729
],
'type': ['productDetails', 'productDetails', 'checkout:confirmation', 'checkout:confirmation', 'productList', 'checkout:confirmation'],
'session': [0,1,2,3,4,5],
'count_session_products': [4, 1, 0, 4, 2, 2],
'loyalty' : [0,0,0,0,0,0]}
test_df = pd.DataFrame(data)
test_df
which gives me:
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | 0 |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | 0 |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | 0 |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | 0 |
First condition for frequent_visiotr and first_time_visitor:
test_df['loyalty'] = np.where((test_df['session'] > 0) & ((test_df['type'] != 'checkout:confirmation')), 'frequent_visitor', None)
test_df.loc[test_df['session'] == 0, 'loyalty'] = 'first_time_visitor'
which gives me:
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | first_time_visitor |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | 0 |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | 0 |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | 0 |
Now this is where I am stuck, I need conditions for first_time_customer, repeat_customer and loyal_customer. The rules for these conditions are described above. I would like the end dataframe to look like this:
user_id | timestamp | type | session | prods | loyalty |
---|---|---|---|---|---|
9EPWZVMNP6D6KWX | 1612139269 | productDetails | 0 | 4 | first_time_visitor |
9EPWZVMNP6D6KWX | 1612139665 | productDetails | 1 | 1 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612139579 | checkout:confirmation | 2 | 0 | first_time_customer |
9EPWZVMNP6D6KWX | 1612141096 | checkout:confirmation | 3 | 4 | repeat_customer |
9EPWZVMNP6D6KWX | 1612143046 | productList | 4 | 2 | frequent_visitor |
9EPWZVMNP6D6KWX | 1612143729 | checkout:confirmation | 5 | 2 | loyal_customer |