Python - Finding items that are bought together the most

Question

I have a pandas dataframe of about 40k entries in the following format:

invoiceNo | item

import pandas as pd
df = pd.DataFrame({'invoiceNo': ['123', '123', '124', '124'], 
                   'item': ['plant', 'grass', 'hammer', 'screwdriver']})

Let's say a customer can buy several items under one single invoice number.

Is there a way for me to check what items get bought together the most?

The first thing I tried was to get all unique IDs to loop through

unique_invoice_id = df.invoiceNo.unique().tolist()

Thanks!

Please read [how to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). — cmaher, Jan 31 '18 at 17:34
Please provide with some data example or approach you have tried? — Bhushan Pant, Jan 31 '18 at 17:35

score 0 · Accepted Answer · answered Jan 31 '18 at 17:50

0

Without loss of generality, I'm going to use lists instead of a dataframe. You can easily extract the lists required from the dataframe if necessary.

from itertools import combinations
from collections import defaultdict

x = [1, 1, 1, 2, 2, 2, 3, 3, 3]  # invoice number
y = ['a', 'b', 'c', 'a', 'c', 'e', 'a', 'c', 'd']  # item

z = defaultdict(set)
for i, j in zip(x, y):
    z[i].add(j)

print(z)

d = defaultdict(int)
for i in range(2, len(set(y))):
    combs = combinations(set(y), i)
    for comb in combs:
        for k, v in z.items():
            if set(comb).issubset(set(v)):
                d[tuple(comb)] += 1

list(reversed(sorted([[v, k] for k, v in d.items()])))

# [[3, ('c', 'a')],
#  [1, ('d', 'c', 'a')],
#  [1, ('d', 'c')],
#  [1, ('d', 'a')],
#  [1, ('c', 'e')],
#  [1, ('c', 'a', 'e')],
#  [1, ('b', 'c', 'a')],
#  [1, ('b', 'c')],
#  [1, ('b', 'a')],
#  [1, ('a', 'e')]]

Interpretation is 'c' and 'a' were bought together 3 times, etc.

answered Jan 31 '18 at 17:50

jpp

159,742
34
281
339

Hey, thank you! This is more or less what I am looking for! Is there a way to only output the top 15 combinations? – Macterror Jan 31 '18 at 18:00
just filter as so: `list(reversed(sorted([[v, k] for k, v in d.items()])))[:15]`. Of course, if there are many combinations with "1" it will only include a fraction to fill up to 15. If this answer helped, feel free to accept. – jpp Jan 31 '18 at 18:01
Awesome! It is taking a little while to run on the full dataset so I am just waiting on the result! Thank you so much – Macterror Jan 31 '18 at 18:03
One last one question, is there way to restrict it to pairs of items? I think trying to find combinations over a pair would make this run for a very long time – Macterror Jan 31 '18 at 18:13
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/164291/discussion-between-macterror-and-jp-data-analysis). – Macterror Jan 31 '18 at 18:15
actually limiting to specific combinations might require some more work – jpp Jan 31 '18 at 18:16

Python - Finding items that are bought together the most

1 Answers1