I would first use .melt()
to reshape the dataframe. Then you could do a .value_counts()
on that column. Lastely, sort the dataframe by the count and drop duplicate customer_ids, keeping the first, which will leave you with the highest count for each customer.
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
Data
print(df)
order_id customer_id order_item1 ... order_item3 order_item4 order_item5
0 33163 11795 2.5 ... NaN NaN NaN
1 38596 11795 2.5 ... NaN NaN NaN
2 35326 10613 2.0 ... 2.0 NaN NaN
3 46139 10613 -2.5 ... NaN NaN NaN
4 57446 5729 2.5 ... NaN NaN NaN
5 65838 5729 2.5 ... NaN NaN NaN
6 71228 5729 2.5 ... NaN NaN NaN
[7 rows x 7 columns]
Melt: Use the 'order_item'
columns to melt as the values:
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
Melted output:
print(df_melt)
customer_id variable value
0 11795 order_item1 2.5
1 11795 order_item1 2.5
2 10613 order_item1 2.0
3 10613 order_item1 -2.5
4 5729 order_item1 2.5
5 5729 order_item1 2.5
6 5729 order_item1 2.5
8 11795 order_item2 2.0
9 10613 order_item2 2.5
11 5729 order_item2 2.5
12 5729 order_item2 2.5
16 10613 order_item3 2.0
Value counts on the 'value'
column:
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
Value Counts Output:
print(value_counts)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
2 10613 -2.5 1
3 10613 2.5 1
4 11795 2.5 2
5 11795 2.0 1
Drop duplicate customer ids, keep the first:
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')
Output:
print(top_sales)
customer_id value count
0 5729 2.5 5
1 10613 2.0 2
4 11795 2.5 2
Full code:
import pandas as pd
import numpy as np
data = {
'order_id':[33163,38596,35326,46139,57446,65838,71228],
'customer_id':[11795,11795,10613,10613,5729,5729,5729],
'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}
df = pd.DataFrame(data)
itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')
value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])
top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')