0

I have a dataframe with a sample attached. I want to give loyalty labels for all the users in my dataframe.

The labels and their conditions are:

  • first_time_visitor - any user with session = 0 (see the first line in loyalty column)

  • frequent_visitor - any user with session > 0 and count_session_products > 0

  • first_time_customer - first time checkout:confirmation appears in the type column

  • repeat_customer - second time of checkout:confirmation appears in the type column

  • loyal_customer - third time of checkout:confirmation appears in the type column

I am having trouble calculating first_time_customer, repeat_customer and loyal_customer labels.

I have had a couple of ideas only, one being to use first_valid_index() or argmax() to find the index and somehow use that in a condition to create the first_time_customer label.

(df_customer_1.type.values == 'checkout:confirmation').argmax()
df_customer_1[df_customer_1.type == 'checkout:confirmation'].first_valid_index()

But I am not sure how to implement these conditions.

In the end I would like the loyalty column in the picture to look like:

first_time_visitor
frequent_visitor
first_time_customer
repeat_customer
frequent_visitor
loyal_customer

Can anyone guide me as to how I can achieve this? I do not have any code to show for it so far therefore you would probably have to come up with something. We can try both pandas or a typical pythonic way.

This is what I have so far, sample from the data:

data = {
'user_id': [
    '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX',
    '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX', '9EPWZVMNP6D6KWX'
],
'timestamp': [
    1612139269, 1612139665, 1612139579,
    1612141096, 1612143046, 1612143729
],
'type': ['productDetails', 'productDetails', 'checkout:confirmation', 'checkout:confirmation', 'productList', 'checkout:confirmation'],
'session': [0,1,2,3,4,5],
'count_session_products': [4, 1, 0, 4, 2, 2],
'loyalty' : [0,0,0,0,0,0]}

test_df = pd.DataFrame(data)
test_df

which gives me:

user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 0
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 0
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 0
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 0
9EPWZVMNP6D6KWX 1612143046 productList 4 2 0
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 0

First condition for frequent_visiotr and first_time_visitor:

test_df['loyalty'] = np.where((test_df['session'] > 0) & ((test_df['type'] != 'checkout:confirmation')), 'frequent_visitor', None)
test_df.loc[test_df['session'] == 0, 'loyalty'] = 'first_time_visitor'

which gives me:

user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 first_time_visitor
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 frequent_visitor
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 0
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 0
9EPWZVMNP6D6KWX 1612143046 productList 4 2 frequent_visitor
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 0

Now this is where I am stuck, I need conditions for first_time_customer, repeat_customer and loyal_customer. The rules for these conditions are described above. I would like the end dataframe to look like this:

user_id timestamp type session prods loyalty
9EPWZVMNP6D6KWX 1612139269 productDetails 0 4 first_time_visitor
9EPWZVMNP6D6KWX 1612139665 productDetails 1 1 frequent_visitor
9EPWZVMNP6D6KWX 1612139579 checkout:confirmation 2 0 first_time_customer
9EPWZVMNP6D6KWX 1612141096 checkout:confirmation 3 4 repeat_customer
9EPWZVMNP6D6KWX 1612143046 productList 4 2 frequent_visitor
9EPWZVMNP6D6KWX 1612143729 checkout:confirmation 5 2 loyal_customer

enter image description here

Zoe
  • 27,060
  • 21
  • 118
  • 148
Salaaned
  • 31
  • 3
  • 2
    Pandas makes this kind of thing very simple. Your question would be much better written if you gave Pandas a try and described where you got stuck. I suggest you search for some question using search terms "pandas add column based on condition". For example, here's a good question that offers a nice example of where to start: https://stackoverflow.com/questions/27041724/using-conditional-to-generate-new-column-in-pandas-dataframe – David Parks Nov 28 '21 at 00:43
  • Hey David, thanks for the suggestion. Can you have a look at the question now? Hopefully it will help you understand the problem better now :) – Salaaned Nov 28 '21 at 01:46
  • 1
    This is a well written question now, much more specific, including code and a simplified example. I would add a separate column for each condition (if you want you can add a final column combining all conditions, but only after you have a column for each specific condition). The first time customer column, for example, would be simply `is_new_customer = df['session'] == 0` which returns a `pd.Series` object which you can then just add to the dataframe with `df['is_new_cusotmer'] = is_new_customer`. Now you have a column of True|False values to identify new customers. – David Parks Nov 28 '21 at 16:15
  • This question also covers adding columns to a dataframe: https://stackoverflow.com/questions/12555323/how-to-add-a-new-column-to-an-existing-dataframe – David Parks Nov 28 '21 at 16:15
  • Here's another useful reference question: https://stackoverflow.com/questions/21702342/creating-a-new-column-based-on-if-elif-else-condition – David Parks Nov 28 '21 at 16:24

1 Answers1

2

This is mostly a mapping exercise so I'll perform a couple a let you do the rest.

def determine_loyalty(data: Series) -> str:
  if data["session"] == 0:
    return "first_time_visitor"
  elif data["session"] > 0 and data["count_session_products"] > 0:
    return "frequent_visitor"


df = DataFrame([{session:0,count_session_products:0},{session:2,count_session_products:6}]) # your dataframe
df["loyalty"] = df.apply(lambda x: determine_loyalty(x))

This will give you:

print(df)
   session  count_session_products             loyalty
0        0                       0  first_time_visitor
1        2                       6    frequent_visitor

Now if you only want first time customers you can do:

first_time_customers: DataFrame = df[df["loyalty"] == "first_time_visitor"]

String filtering is bound to be slower than int/bool comparisons so I'd strongly suggest you give the dataframes levels that your application knows the meaning of in the model. For example:

class LoyaltyLevels(Enum):
    FIRST_TIME_VISITOR=1
    FREQUENT_VISITOR=2



# change determine_loyalty to give integer values instead
def determine_loyalty(data: Series) -> str:
  if data["session"] == 0:
    return LoyaltyLevels.FIRST_TIME_VISITOR.value
  elif data["session"] > 0 and data["count_session_products"] > 0:
    return LoyaltyLevels.FREQUENT_VISITOR.value


# now we can filter using the integer value

first_time = df[df["loyalty"] == LoyaltyLevels.FIRST_TIME_VISITOR.value]

bguest
  • 215
  • 1
  • 7
  • I understand this and have the conditions for first_time_visitor and frequent_visitor already created, but the problem is how do I apply the filter for the first_time_customer label? What would the condition be for it? – Salaaned Nov 28 '21 at 00:48
  • @Salaaned answered above – bguest Nov 28 '21 at 00:53