1

I have a large dataframe with receipts and items. The number of unique items is 48000, the number of receipts is 3.7 millions. I need to calculate for all pairs of items how often they appear in one receipt. My shit-code, which is given below according to preliminary calculations will work until the death of the universe. I'm sure there is some pandas magic that makes my task much easier, but I can't find anything.

uniq_itm=train['item_name'].unique()
i = 0
for itm_x in uniq_itm:
    i = i + 1
    if i > len(uniq_itm)/2:
        break
    for itm_y in uniq_itm:
        percent_complete = round((i/(len(uniq_itm)/2))*100,2)
        if itm_x != itm_y:
            k = len(list(set(train.query('item_name==@itm_x')['receipt_id'].unique()) & set(train.query('item_name==@itm_y')['receipt_id'].unique())))
            if k > 0:
                print (itm_x+' '+itm_y+' '+str(k)+' '+str(percent_complete)+'%')

Dataframe example (sorry for cyrilic): enter image description here

Ilya
  • 25
  • 1
  • 6

2 Answers2

1

You probably want to use a frequent pattern mining algorithm. FPGrowth is efficient as it allows you to determine the max size of the itemsets ahead of time, in this case 2 for pairs, and it allows you to input the minimum frequency (as a percentage) that the items must appear.

Consider the following:

import pandas as pd
from mlxtend.preprocessing import TransactionEncoder
from mlxtend.frequent_patterns import fpgrowth

# Sample data in a similar structure to yours
df = pd.DataFrame({
    'reciept_id':[1,1,2,2,3,3],
    'reciept_dayofweek':[4,4,5,5,6,6],
    'reciept_time':['20:20','20:20','12:13','12:13','11:10','11:10'],
    'item_name':['Milk','Onion','Dill','Onion','Milk','Onion']
    
})

# Create an array of items per transactions
dataset = df.groupby(['reciept_id','reciept_dayofweek','reciept_time'])['item_name'].apply(list).values

# Create the required structure for data to go into the algorithm
te = TransactionEncoder()
te_ary = te.fit(dataset).transform(dataset)
df = pd.DataFrame(te_ary, columns=te.columns_)

# Generate frequent items sets with a support of 1/len(dataset)
# This is the same as saying give me every combination that shows up at least once
# The maximum size of any given itemset is 2, but you could change it to have any number
frequent = fpgrowth(df, min_support=1/len(dataset),use_colnames=True, max_len=2)

# Get rid of single item records
frequent = frequent[frequent['itemsets'].apply(lambda x: len(x))==2]

# Muliply support by the number of transactions to get the count of times each item set appeared
# in the original data set
frequent['frequency'] = frequent['support'] * len(dataset)

# View the results
print(frequent[['itemsets','frequency']])

Output

    itemsets  frequency
3  (Milk, Onion)        2.0
4  (Dill, Onion)        1.0
Chris
  • 15,819
  • 3
  • 24
  • 37
  • Thank you! I will learn your example more, but after the first attempt, I got: MemoryError: Unable to allocate 166. GiB for an array with shape (3704200, 48225) and data type bool – Ilya Mar 07 '21 at 18:21
  • Yeah you'll need a virtual machine with lots of memory and plenty of time, it's a very large data set. – Chris Mar 08 '21 at 00:50
  • 1
    I rented a powerful server for a few hours and your approach produced brilliant results. Thank you very much. – Ilya Mar 08 '21 at 09:45
0

I'm not sure performance/memory-wise how is it going to behave but I would probably tackle the problem translating the item_name column from string to categorical, then I would make a pivot table from the original dataframe to get a row for each unique receipt:

# Make item_name categorical and encode it for simplicity
df["item_name"] = df["item_name"].astype('category')
df["item_name_cat"] = df["item_name"].cat.codes

# Pivot table
reshaped_df = df.pivot_table(index="receipt_id", columns="item_name_cat", values="item_quantity").reset_index()

From that I would get the number of unique combinations with itertools and use those to obtain the number of rows that match the required definition (both products in the receipt):

from itertools import combinantions

# Get all the unique items and compute combinations
unique_names = df.item_name_cat.unique()
unique_couples = combinations(unique_names, 2)

# get the number of unique receipts
number_of_rows = len(df_reshaped)

# For each combination count the number of rows that have more then 1 of both the items in it
for combination in unique_couples:
    num_of_results = len(reshaped_df[(reshaped_df[combination[0]]>0) & (reshaped_df[combination[1]]>0)])
    percentage = num_of_results / num_of_rows * 100
    print("{}% of receipt contain items {} and {}".format(percentage, combination[0], combination[1])

I'm printing the output here but you should be able to export them in a more manageable way.

Kastakin
  • 121
  • 1
  • 9
  • I got 'ValueError: Unstacked DataFrame is too big, causing int32 overflow' after reshaped_df = .... – Ilya Mar 07 '21 at 18:26
  • You could probably try to do this operation in chunks: https://stackoverflow.com/a/64814616/11739938 – Kastakin Mar 07 '21 at 18:29