1

I would like to count the number of pairs of items in a column. I made my solution for it, but I would like to know if there are more concise solutions.

Here is an example and my approach. I have a DataFrame like this.

df = pd.DataFrame({'id':[1,1,2,2], 'item':['apple','orange','orange','apple']})

Finally, I would like to know what items are bought most together. Therefore, in this case, I'd like to get a result that orange and apple are bought most together.

Then, I did groupby based on values in id column.

id_group = df.groupby('id')

Then, to count the number of pairs of items in item column, I made a function like below and applied to item column of id_group. After this, I combined lists of tuples using sum(). Finally, I used Counter() to count the number of pairs containing same items. In combos(), I used sorted()to avoid counting ('apple','orange') and ('orange','apple') separately.

Are there better approaches to get the result showing there are 2 pairs of ('apple','orange') or 2 pairs of ('orange','apple')

import itertools 
from collections import Counter
def combos(x):
     combinations = []
     num = x.size
     while num != 1:
          combinations += list(itertools.combinations(x,num))
          num -= 1
     element_sorted = map(sorted,combinations)
     return list(map(tuple,element_sorted))

k= id_group['item'].apply(lambda x:combos(x)).sum()
Counter(k)

2 Answers2

2

Use all_subsets function with change 0 to 2 for pairs, triples... like your soluton:

#https://stackoverflow.com/a/5898031
from itertools import chain, combinations
def all_subsets(ss):
    return chain(*map(lambda x: combinations(ss, x), range(2, len(ss)+1)))

And then flatten values, I think better is not use sum to concatenate lists. It looks fancy but it's quadratic and should be considered bad practice.

So here is used flattening with sorted tuples in list comprehension:

k = [tuple(sorted(z)) for y in id_group['item'].apply(all_subsets) for z in y]

print (Counter(k))
Counter({('apple', 'orange'): 2})
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

How about this?

from collections import Counter
k = df.groupby('id')['item'].apply(lambda x: tuple(x.sort_values()))
Counter(k)

Counter({('apple', 'orange'): 2})
Bertil Johannes Ipsen
  • 1,656
  • 1
  • 14
  • 27