Python product frequently bought with

Question

I have retail store transactional data and want to see what categories are bought together. The data is in the below format:

transaction_no	product_id	category
1	100012	A
1	121111	A
1	121127	B
1	121127	G
2	465222	N
2	121127	M
3	121127	F
3	121127	G
3	121127	F
4	465222	M
4	121127	N

Rules:

The result should be aggregated based on unique transaction numbers.
Also, the order shouldn't matter in this case (e.g. A bought with B is the same as B bought with A).
If a category is repeated within the same transaction, it should be counted as 1 only (e.g. in transaction_no = 1, category A is counted once)

Expected output:

bucket	count
A, B, G	1
N, M	2
F, G	1

How do I achieve this?

jezrael · Accepted Answer · 2022-11-28T11:56:15.917

Use GroupBy.agg for aggregate frozenset, then count values by Series.value_counts and last create DataFrame with join for strings from frozensets:

df1 = (df.groupby('transaction_no')['category']
        .agg(frozenset)
        .value_counts()
        .rename(lambda x: ', '.join(sorted(x)))
        .rename_axis('bucket')
        .reset_index(name='count'))
print (df1)
    bucket  count
0     M, N      2
1     F, G      1
2  A, B, G      1

Another idea:

df1 = (df.groupby('transaction_no')['category']
        .agg(lambda x: ', '.join(sorted(set((x)))))
        .value_counts()
        .rename_axis('bucket')
        .reset_index(name='count')
        )
print (df1)
    bucket  count
0     M, N      2
1     F, G      1
2  A, B, G      1

Python product frequently bought with

1 Answers1