1

I have a dataset with a unique customer_id and several order_id of each time a unique customer made a purchase. It's about the reading glasses, so I removed the product information and now I have just the strongness of the reading glasses (+1, +4, +2.5 etc).

My dataframe for this feature looks like this: screenshot

I've tried multiple things, such as:

testdf = testdf.groupby(['customer_id', 'order_id'])['order_item1', 'order_item2', 'order_item3', 'order_item4', 'order_item5']\
    .agg(list)\
    .apply(lambda x:list(combinations(set(x),2)))\
    .explode()

and:

def top_product(g):

    product_cols = [col for col in g.columns if col.startswith('order_item')]
    try:
        out = (g[product_cols].stack().value_counts(normalize=True)
                             .reset_index().iloc[0])
        out.index = ['most_product']
        return out
    except IndexError:
        return pd.Series({'order_item': 'None', 'most_product' : 0})

output = testdf.groupby('order_id').apply(top_product)

Both don't work. I would like to know of each customer which product they bought most. So for customer_id 11795 it would be 2.5. Any idea on how to do this?

Jan Wilamowski
  • 3,308
  • 2
  • 10
  • 23
Bas
  • 13
  • 3
  • 2
    Please provide the DataFrame in the question so we can copy it (we can't copy from image) – 965311532 Jun 02 '22 at 09:59
  • You're probably looking for the `mode` - see here https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value – Mortz Jun 02 '22 at 10:08
  • Does this answer your question? [GroupBy pandas DataFrame and select most common value](https://stackoverflow.com/questions/15222754/groupby-pandas-dataframe-and-select-most-common-value) – Mortz Jun 02 '22 at 10:09

1 Answers1

0

I would first use .melt() to reshape the dataframe. Then you could do a .value_counts() on that column. Lastely, sort the dataframe by the count and drop duplicate customer_ids, keeping the first, which will leave you with the highest count for each customer.

import pandas as pd
import numpy as np


data = {
        'order_id':[33163,38596,35326,46139,57446,65838,71228],
        'customer_id':[11795,11795,10613,10613,5729,5729,5729],
        'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
        'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
        'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
        'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
        'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}

df = pd.DataFrame(data)

Data

print(df)
   order_id  customer_id  order_item1  ...  order_item3  order_item4  order_item5
0     33163        11795          2.5  ...          NaN          NaN          NaN
1     38596        11795          2.5  ...          NaN          NaN          NaN
2     35326        10613          2.0  ...          2.0          NaN          NaN
3     46139        10613         -2.5  ...          NaN          NaN          NaN
4     57446         5729          2.5  ...          NaN          NaN          NaN
5     65838         5729          2.5  ...          NaN          NaN          NaN
6     71228         5729          2.5  ...          NaN          NaN          NaN

[7 rows x 7 columns]

Melt: Use the 'order_item' columns to melt as the values:

itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')

Melted output:

print(df_melt)
    customer_id     variable  value
0         11795  order_item1    2.5
1         11795  order_item1    2.5
2         10613  order_item1    2.0
3         10613  order_item1   -2.5
4          5729  order_item1    2.5
5          5729  order_item1    2.5
6          5729  order_item1    2.5
8         11795  order_item2    2.0
9         10613  order_item2    2.5
11         5729  order_item2    2.5
12         5729  order_item2    2.5
16        10613  order_item3    2.0

Value counts on the 'value' column:

value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])

Value Counts Output:

print(value_counts)
   customer_id  value  count
0         5729    2.5      5
1        10613    2.0      2
2        10613   -2.5      1
3        10613    2.5      1
4        11795    2.5      2
5        11795    2.0      1

Drop duplicate customer ids, keep the first:

top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')

Output:

print(top_sales)
   customer_id  value  count
0         5729    2.5      5
1        10613    2.0      2
4        11795    2.5      2

Full code:

import pandas as pd
import numpy as np


data = {
        'order_id':[33163,38596,35326,46139,57446,65838,71228],
        'customer_id':[11795,11795,10613,10613,5729,5729,5729],
        'order_item1':[2.5,2.5,2,-2.5,2.5,2.5,2.5],
        'order_item2':[np.nan,2,2.5,np.nan,2.5,2.5,np.nan],
        'order_item3':[np.nan,np.nan,2,np.nan,np.nan,np.nan,np.nan],
        'order_item4':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan],
        'order_item5':[np.nan,np.nan,np.nan,np.nan,np.nan,np.nan,np.nan]}

df = pd.DataFrame(data)

itemCols = [x for x in df.columns if 'order_item' in x]
df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset='value')

value_counts = df_melt.groupby('customer_id')['value'].value_counts().rename('count').reset_index()
value_counts = value_counts.sort_values(['customer_id', 'count'], ascending=[True, False])

top_sales = value_counts.drop_duplicates(subset='customer_id', keep='first')
chitown88
  • 27,527
  • 4
  • 30
  • 59
  • I got the error: TypeError: Index(...) must be called with a collection of some kind, 'value' was passed. with the code in the comments. It's in the df_melt code line – Bas Jun 02 '22 at 13:57
  • provide me with a sample of your `testdf` (not a screen shot) – chitown88 Jun 02 '22 at 14:00
  • also what verson of `pandas` you running? – chitown88 Jun 02 '22 at 14:00
  • try this for that line `df_melt = pd.melt(df, id_vars='customer_id', value_vars=itemCols).dropna(subset=['value'], axis=0)` – chitown88 Jun 02 '22 at 14:03