0

I'm trying to find the count of combinations of products inside receipts in Pandas. Thru some help I was able to find the combination of two products in a previous question, but I still have doubts of how to achieve this result, and how I could escalate it to find more combinations.

I have two columns in the data frame, one of receipts and the other of products that were bought:

receipt_id   product_name
1            apple
1            bread
1            cola
2            apple
2            cola
2            bread
3            apple
3            cola
4            apple
4            cola
4            bread

I'd like to find the count of combinations of 3 products inside the receipts. So, for this example, the result should be something like this:

product1     product2     product3  count
apple        bread        cola      3

Which means: this first combination of products appears in 3 receipts. The count of 'apple' and 'cola' does not appear, as it was a combination of just two products.

I know I need to use the groupby function, but now sure of how to organize data inside the same column. Any help is aprecciated! Thanks in advance.

1 Answers1

1

This works although it's quite long.

First I created a new data frame where you have all the unique combinations of your products. I added 'rice' to the equation because you have 4 combinations of 3 products when you have 4 different products.

import pandas as pd
from itertools import combinations

 df = pd.DataFrame({'receipt_id': [1,1,1,2,2,2,3,3,4,4,4],
                'product_name': ['apple', 'bread', 'rice', 'apple', 'rice', 'bread', 'apple', 'cola', 'apple', 'cola', 'bread']})

df
      receipt_id product_name
0            1        apple
1            1        bread
2            1         rice
3            2        apple
4            2         rice
5            2        bread
6            3        apple
7            3         cola
8            4        apple
9            4         cola
10           4        bread

I then obtained the unique products and all the combinations thereof.

unique_products = df.product_name.unique().tolist() #get the unique products
combo = list(combinations(unique_products, 3)) #get a list with all combinations

new_df = pd.DataFrame()

new_df['Product_one'] = [e[0] for e in combo] #only the first element in each tuple
new_df['Product_two'] = [e[1] for e in combo] #the second element
new_df['Product_three'] = [e[2] for e in combo] #the third 

So now we have a data frame with all combinations as rows:

    Product_one Product_two Product_three
0       apple       bread          rice
1       apple       bread          cola
2       apple        rice          cola
3       bread        rice          cola

Then I created a function that accepts three lists as inputs and returns a list that only contains the values that are present in the three lists. I did a for loop over the original data frame and counted the id's that were related to each product.

def get_unique(l1, l2, l3):
#take the 3 lists and return only those elements present in the 3 lists 
    return [e for e in l2 if e in l1 and e in l3]


unique_ids = []
for i in range(len(new_df)):

    list_one = df.receipt_id[(df.product_name == new_df.iloc[i,0])].unique()
    list_two = df.receipt_id[(df.product_name == new_df.iloc[i,1])].unique()
    list_three = df.receipt_id[(df.product_name == new_df.iloc[i,2])].unique()

    unique_ids.append(get_unique(list_one, list_two, list_three))

So now this list contains other lists which contain the id's that satisfy the above conditions. I then added a new column with the length of each element.

new_df['count'] = [len(e) for e in unique_ids]
print(new_df)

    Product_one Product_two Product_three  count
0       apple       bread          rice      2
1       apple       bread          cola      1
2       apple        rice          cola      0
3       bread        rice          cola      0