I have retail store transactional data and want to see what categories are bought together. The data is in the below format:
transaction_no | product_id | category |
---|---|---|
1 | 100012 | A |
1 | 121111 | A |
1 | 121127 | B |
1 | 121127 | G |
2 | 465222 | N |
2 | 121127 | M |
3 | 121127 | F |
3 | 121127 | G |
3 | 121127 | F |
4 | 465222 | M |
4 | 121127 | N |
Rules:
- The result should be aggregated based on unique transaction numbers.
- Also, the order shouldn't matter in this case (e.g. A bought with B is the same as B bought with A).
- If a category is repeated within the same transaction, it should be counted as 1 only (e.g. in transaction_no = 1, category A is counted once)
Expected output:
bucket | count |
---|---|
A, B, G | 1 |
N, M | 2 |
F, G | 1 |
How do I achieve this?