I have a large dataframe with receipts and items. The number of unique items is 48000, the number of receipts is 3.7 millions. I need to calculate for all pairs of items how often they appear in one receipt. My shit-code, which is given below according to preliminary calculations will work until the death of the universe. I'm sure there is some pandas magic that makes my task much easier, but I can't find anything.
uniq_itm=train['item_name'].unique()
i = 0
for itm_x in uniq_itm:
i = i + 1
if i > len(uniq_itm)/2:
break
for itm_y in uniq_itm:
percent_complete = round((i/(len(uniq_itm)/2))*100,2)
if itm_x != itm_y:
k = len(list(set(train.query('item_name==@itm_x')['receipt_id'].unique()) & set(train.query('item_name==@itm_y')['receipt_id'].unique())))
if k > 0:
print (itm_x+' '+itm_y+' '+str(k)+' '+str(percent_complete)+'%')