given a list of purchase events (customer_id,item)
1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws
i'm trying to build a data structure that tells how many times an item was bought with another item. Not bought at the same time, but bought since I started saving data. the result would look like
{
hammer : {screwdriver : 1, nails : 2},
screwdriver : {hammer : 1, screws : 1, nails : 1},
screws : {screwdriver : 1, nails : 1},
nails : {hammer : 1, screws : 1, screwdriver : 1}
}
indicating That a hammer was bought with nails twice (persons 1,3) and a screwdriver once (person 1), screws were bought with a screwdriver once (person 3), and so on...
my current approach is
users = dict where userid is the key and a list of items bought is the value
usersForItem = dict where itemid is the key and list of users who bought item is the value
userlist = temporary list of users who have rated the current item
pseudo:
for each event(customer,item)(sorted by item):
add user to users dict if not exists, and add the items
add item to items dict if not exists, and add the user
----------
for item,user in rows:
# add the user to the users dict if they don't already exist.
users[user]=users.get(user,[])
# append the current item_id to the list of items rated by the current user
users[user].append(item)
if item != last_item:
# we just started a new item which means we just finished processing an item
# write the userlist for the last item to the usersForItem dictionary.
if last_item != None:
usersForItem[last_item]=userlist
userlist=[user]
last_item = item
items.append(item)
else:
userlist.append(user)
usersForItem[last_item]=userlist
So, at this point, I have 2 dicts - who bought what, and what was bought by whom. Here's where it gets tricky. Now that usersForItem is populated, I loop through it, loop through each user who bought the item, and look at the users' other purchases. I acknowledge that this is not the most pythonic way of doing things - I'm trying to make sure I get the correct result(which I am) before getting fancy with the Python.
relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
relatedItems[key]={}
related=[]
for ux in listOfReaders:
for itemRead in users[ux]:
if itemRead != key:
if itemRead not in related:
related.append(itemRead)
relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1
calc jaccard/tanimoto similarity between relatedItems[key] and its values
Is there a more efficient way that I can be doing this? Additionally, if there is a proper academic name for this type of operation, I'd love to hear it.
edit: clarified to include the fact that I'm not restricting purchases to items bought together at the same time. Items can be bought at any time.