11

I am trying to figure out what combination of clothing customers are buying together. I can figure out the exact combination, but the problem I can't figure out is the count that includes the combination + others.

For example, I have:

Cust_num  Item    Rev
Cust1     Shirt1  $40
Cust1     Shirt2  $40
Cust1     Shorts1 $40
Cust2     Shirt1  $40
Cust2     Shorts1 $40

This should result in:

Combo                  Count
Shirt1,Shirt2,Shorts1    1
Shirt1,Shorts1           2

The best I can do is unique combinations:

Combo                 Count
Shirt1,Shirt2,Shorts1   1
Shirt1,Shorts1          1

I tried:

df = df.pivot(index='Cust_num',columns='Item').sum()
df[df.notnull()] = "x"
df = df.loc[:,"Shirt1":].replace("x", pd.Series(df.columns, df.columns))
col = df.stack().groupby(level=0).apply(','.join)
df2 = pd.DataFrame(col)
df2.groupby([0]).size().reset_index(name='counts')

But that is just the unique counts.

lys_dad
  • 121
  • 6

4 Answers4

8

Using pandas.DataFrame.groupby:

grouped_item = df.groupby('Cust_num')['Item']
subsets = grouped_item.apply(lambda x: set(x)).tolist()
Count = [sum(s2.issubset(s1) for s1 in subsets) for s2 in subsets]
combo = grouped_item.apply(lambda x:','.join(x))
combo = combo.reset_index()
combo['Count']=Count

Output:

  Cust_num                   Item  Count
0    Cust1  Shirt1,Shirt2,Shorts1      1
1    Cust2         Shirt1,Shorts1      2
Chris
  • 29,127
  • 3
  • 28
  • 51
  • How is finding subsets finding inclusive combination of `df['Item']`? Subsets are not combinations. Below is output of what you doing: `{' Shirt2', ' Shorts1', ' Shirt1'}: True` followed by `{' Shorts1', ' Shirt1'}: True` and then `{' Shorts1', ' Shirt1'}: True`. Then you sum then to get `[1,2]`. I agree my approach i did is wrong so is yours. I would think `@ResidentSleeper` has correct answer. – Kill3rbee Lee Mtoti Apr 08 '19 at 07:08
  • `@Chris`, I think you need to find combination of `Item` first which would give you your `subsets`. You would sum unique. FYI, I did not mock your comment. I was asking a question just like you asked me a question. I did not throw a hissy or downvote you. Thanks for sharing – Kill3rbee Lee Mtoti Apr 08 '19 at 07:45
  • @LeeMtoti Apologies for a strong language. I've deleted it. BTW, `grouped_item` contains a combination of each `CustN`'s choice of clothing. `lambda x: set(x)` was implemented for a `issubset` comparison. As you pointed out, `issubset` returns `True` if and only if a set is contained in other set, which I still believe is what OP wants. – Chris Apr 08 '19 at 07:48
  • I believe the term _combination_ is subtle misleading. My understanding is a set of each customer's choices of Item. Hope this makes my answer and intention clearer. – Chris Apr 08 '19 at 07:50
  • Apologies if the language wasn't correct. I wasn't sure how to describe it. I tried this with the first 1,000 customers, and it worked! I'm running it now with the full list. – lys_dad Apr 08 '19 at 15:42
2

I think you need to create a combination of items first.

How to get all possible combinations of a list’s elements?

I used the function from Dan H's answer.

from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(0, len(ss)+1)))

uq_items = df.Item.unique()

list(all_subsets(uq_items))

[(),
 ('Shirt1',),
 ('Shirt2',),
 ('Shorts1',),
 ('Shirt1', 'Shirt2'),
 ('Shirt1', 'Shorts1'),
 ('Shirt2', 'Shorts1'),
 ('Shirt1', 'Shirt2', 'Shorts1')]

And use groupby each customer to get their items combination.

ls = []

for _, d in df.groupby('Cust_num', group_keys=False):
    # Get all possible subset of items
    pi = np.array(list(all_subsets(d.Item)))

    # Fliter only > 1
    ls.append(pi[[len(l) > 1 for l in pi]])

Then convert to Series and use value_counts().

pd.Series(np.concatenate(ls)).value_counts()

(Shirt1, Shorts1)            2
(Shirt2, Shorts1)            1
(Shirt1, Shirt2, Shorts1)    1
(Shirt1, Shirt2)             1
ResidentSleeper
  • 2,385
  • 2
  • 10
  • 20
2

Late answer, but you can use:

df = df.groupby(['Cust_num'], as_index=False).agg(','.join).drop(columns=['Rev']).set_index(['Item']).rename_axis("combo").rename(columns={"Cust_num": "Count"})
df['Count'] = df['Count'].str.replace(r'Cust','')

combo                   Count                 
Shirt1,Shirt2,Shorts1     1
Shirt1,Shorts1            2
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268
0

My version which I believe is easier to understand

new_df = df.groupby("Cust_num").agg({lambda x: ''.join(x.unique())})

new_df ['count'] = range(1, len(new_df ) + 1)

Output:

                            Item      Rev count
                        <lambda> <lambda>      
Cust_num                                       
Cust1      Shirt1 Shirt2 Shorts1      $40     1
Cust2             Shirt1 Shorts1      $40     2

Since you do not need the Rev column, you can drop it:

new_df = new_df = new_df.drop(columns=["Rev"]).reset_index()

new_df

Output:

  Cust_num                    Item count
                          <lambda>      
0    Cust1   Shirt1 Shirt2 Shorts1     1
1    Cust2          Shirt1 Shorts1     2

This edit is to respond to @Chris by looking at his approach written using list comprehension. He created an list of sets:

[{' Shirt1', ' Shirt2', ' Shorts1'}, {' Shirt1', ' Shorts1'}]

Then next step finds the subsets:

for s1 in subsets:
    for s2 in subsets:
        if s2.issubset(s1):
            print("{}: {}".format(s2,s2.issubset(s1)))

Output:

{' Shirt2', ' Shorts1', ' Shirt1'}: True
{' Shorts1', ' Shirt1'}: True
{' Shorts1', ' Shirt1'}: True

You asked me to explain myself and I did. However after thinking about it, I realized your approach was wrong too. As such I was not mocking you, but thanking you for making me think about my solution. Also thanks to @ResidentSleeper for his solution.

Kill3rbee Lee Mtoti
  • 246
  • 1
  • 2
  • 11