3

given a list of purchase events (customer_id,item)

1-hammer
1-screwdriver
1-nails
2-hammer
2-nails
3-screws
3-screwdriver
4-nails
4-screws

i'm trying to build a data structure that tells how many times an item was bought with another item. Not bought at the same time, but bought since I started saving data. the result would look like

{
       hammer : {screwdriver : 1, nails : 2}, 
  screwdriver : {hammer : 1, screws : 1, nails : 1}, 
       screws : {screwdriver : 1, nails : 1}, 
        nails : {hammer : 1, screws : 1, screwdriver : 1}
}

indicating That a hammer was bought with nails twice (persons 1,3) and a screwdriver once (person 1), screws were bought with a screwdriver once (person 3), and so on...

my current approach is

users = dict where userid is the key and a list of items bought is the value

usersForItem = dict where itemid is the key and list of users who bought item is the value

userlist = temporary list of users who have rated the current item

pseudo:
for each event(customer,item)(sorted by item):
  add user to users dict if not exists, and add the items
  add item to items dict if not exists, and add the user
----------

for item,user in rows:

  # add the user to the users dict if they don't already exist.
  users[user]=users.get(user,[])

  # append the current item_id to the list of items rated by the current user
  users[user].append(item)

  if item != last_item:
    # we just started a new item which means we just finished processing an item
    # write the userlist for the last item to the usersForItem dictionary.
    if last_item != None:
      usersForItem[last_item]=userlist

    userlist=[user]

    last_item = item
    items.append(item)
  else:
    userlist.append(user)

usersForItem[last_item]=userlist   

So, at this point, I have 2 dicts - who bought what, and what was bought by whom. Here's where it gets tricky. Now that usersForItem is populated, I loop through it, loop through each user who bought the item, and look at the users' other purchases. I acknowledge that this is not the most pythonic way of doing things - I'm trying to make sure I get the correct result(which I am) before getting fancy with the Python.

relatedItems = {}
for key,listOfUsers in usersForItem.iteritems():
  relatedItems[key]={}
  related=[]

  for ux in listOfReaders:
    for itemRead in users[ux]:
      if itemRead != key:
        if itemRead not in related:
          related.append(itemRead)
        relatedItems[key][itemRead]= relatedItems[key].get(itemRead,0) + 1    

  calc jaccard/tanimoto similarity between relatedItems[key] and its values

Is there a more efficient way that I can be doing this? Additionally, if there is a proper academic name for this type of operation, I'd love to hear it.

edit: clarified to include the fact that I'm not restricting purchases to items bought together at the same time. Items can be bought at any time.

Neil Kodner
  • 2,901
  • 3
  • 27
  • 36
  • How big is your dataset? Amazon uses some algos on hadoop to look through every item it has ever sold and generate the suggested items... Also, please see http://stackoverflow.com/questions/1849693/simple-suggestion-recommendation-algorithm – bwawok Jun 24 '10 at 12:15
  • lets say 2 million events with about 5000 users and 10000 items. Wouldn't K-nearest neighbors involve comparing every item against every other item, even ones with no overlap? – Neil Kodner Jun 24 '10 at 12:33
  • If you're only interested in how many times items were bought together, why are you worried about users? You could group simply by purchase events, so that you'd have a dictionary of products keyed to other products bought with them. In fact, you could key each product to a max-heap of other products with values of how many times they're bough together, such that you'd be able to efficiently get the most commonly purchased item(s) for any given item. (The heap bit might go too far). If this is a workable idea for you I can write up a more complete answer. – nearlymonolith Jun 24 '10 at 13:44
  • Anthony this is exactly the feedback I'm looking for. Please elaborate. I don't care about userids so much that i want to count the number of times items a and b were purchased by the same user. – Neil Kodner Jun 24 '10 at 14:01
  • Are you looking for the number of times purchased by the same user? Or the number of times they exist within the same user even? I.e. is it important that I bought hammers and nails at the same time, or that I bought a hammer, then came back and bought nails at some other point. – nearlymonolith Jun 24 '10 at 14:16
  • Anthony, I'm not looking for items bought at the same time. I'm just looking for overlap since the beginning of time. Incidentally, i'll be doing a select distinct on item,user and i'll be disregarding if a user bought more than 1 of an item. – Neil Kodner Jun 24 '10 at 14:58
  • I gave it a shot - not sure of the quickness/optimization of my algorithm. I'd personally suggest looking at something like MongoDB, a NoSQL database, since it seems it may lend itself nicely to solving this kind of problem (what with map/reduce and all) – nearlymonolith Jun 24 '10 at 17:16

4 Answers4

3

Do you really need to precompute all the possible pairs? What if you were to do it lazily, i.e. on an on-demand basis?

This can be represented as a 2D matrix. The rows correspond to the customers and the columns correspond to the products.

Each entry is either 0 or 1, saying whether the product corresponding to the column was bought by the customer corresponding to the row.

If you look at each column as a vector of (around 5000) 0s and 1s, then the number of times two products were bought together is just the dot product of the corresponding vectors!

Thus you can just compute those vectors first, and then compute the dot product lazily, on demand.

To compute the dot-product:

Now, a good representation of a vector with only 0s and 1s is an array of integers, which is basically a bitmap.

For 5000 entries, you will need an array of 79 64-bit integers.

So given two such arrays, you need to count the number of 1s that are common.

To count the number of bits common to two integers, first you can do the bitwise AND and then count the numbers of 1s that are set in the resulting number.

For this either you can use lookup tables or some bitcounting methods (not sure if python will support them), like here: http://graphics.stanford.edu/~seander/bithacks.html

So your algorithm will be something like this:

  • Initialize an array of 79 64 bit integers for each product.

  • For each customer, look at the products bought and set the appropriate bit for that customer in that corresponding products.

  • Now given a query of two products for which you need to know the number of customer who bought them together, just take the dot-product as describe above.

This should be reasonably fast.

As a further optimization, you can possibly consider grouping the customers.

  • I can't believe I didn't think to use the dot-product myself, especially coming off of another project involving cosine similiarity! Thanks. Enjoy your upvote. – Neil Kodner Jun 25 '10 at 18:27
  • However, if I create a 2D matrix, won't Order-N-Squared be high because I'm comparing every item against every other item? One of the reasons I'm looking at using Jaccard/Tanimoto is that it saves me from having to compare non-related items. – Neil Kodner Jun 25 '10 at 18:30
  • 1
    @Neil: You are creating a vector for each product. Initializing a vector is O(M) (M is number of customers), but that is really fast for an in-memory bitmap, where you can zero out in blocks. Once you have the initialization done, the processing cost is O(S) where S is the number of 1s, and then O(M) for each query (given two products). Your problem is basically determining the size of set intersection, so depending on the sparsity, it might be better to represent your sets using dicts to represent the set of customers who bought a product. For ~5K customers it really might not matter. –  Jun 25 '10 at 19:35
  • ...continuing... Dicts have a overhead of computing hash keys etc, which bitmaps don't have. So the choice really depends on your data like sparsity etc. Of course, dicts are easier to code up I suppose. btw, a bitmap/vector is just another form of hashing, just like dicts are. –  Jun 25 '10 at 19:38
  • Moron, so i exaggerated a little bit for simplicity's sake - how about tens (maybe hundreds) of thousands of customers and tens of thousands of items? – Neil Kodner Jun 26 '10 at 00:20
  • @Neil. Then you could consider a combination of dicts and bitmaps (the grouping of customers I was talking about in my answer). You group customers into groups G1, G2... and then maintain a dict for each product, keyed on the group. The values stored will be the bitmaps/vectors for the groups in the dict. Using only dicts corresponds to each customer being in his own group. Using just bitmap corresponds to all customers in one group. If you group the customers based on similar tastes, you might get good results. You could even do clever stuff with overlaps between groups etc... –  Jun 26 '10 at 05:08
2
events = """\
1-hammer 
1-screwdriver 
1-nails 
2-hammer 
2-nails 
3-screws 
3-screwdriver 
4-nails 
4-screws""".splitlines()
events = sorted(map(str.strip,e.split('-')) for e in events)

from collections import defaultdict
from itertools import groupby

# tally each occurrence of each pair of items
summary = defaultdict(int)
for val,items in groupby(events, key=lambda x:x[0]):
    items = sorted(it[1] for it in items)
    for i,item1 in enumerate(items):
        for item2 in items[i+1:]:
            summary[(item1,item2)] += 1
            summary[(item2,item1)] += 1

# now convert raw pair counts into friendlier lookup table
pairmap = defaultdict(dict)
for k,v in summary.items():
    item1, item2 = k
    pairmap[item1][item2] = v

# print the results    
for k,v in sorted(pairmap.items()):
    print k,':',v

Gives:

hammer : {'nails': 2, 'screwdriver': 1}
nails : {'screws': 1, 'hammer': 2, 'screwdriver': 1}
screwdriver : {'screws': 1, 'nails': 1, 'hammer': 1}
screws : {'nails': 1, 'screwdriver': 1}

(This addresses your initial request grouping items by purchase event. To group by user, just change the first key of the events list from event number to user id.)

PaulMcG
  • 62,419
  • 16
  • 94
  • 130
1

Paul's answer might be the best, but here's what I came up with over my lunch break (untested, admittedly, but still a fun exercise in thinking). Not sure of the quickness/optimization of my algorithm. I'd personally suggest looking at something like MongoDB, a NoSQL database, since it seems it may lend itself nicely to solving this kind of problem (what with map/reduce and all)

# assuming events is a dictionary of id keyed to item bought...
user = {}
for (cust_id, item) in events:
    if not cust_id in users:
        user[cust_id] = set()
    user[cust_id].add(item)
# now we have a dictionary of cust_ids keyed to a set of every item
# they've ever bought (given that repeats don't matter)
# now we construct a dict of items keyed to a dictionary of other items
# which are in turn keyed to num times present
items = {}
def insertOrIter(d, k, v):
    if k in d:
        d[k] += v
    else:
        d[k] = v
for key in user:
    # keep track of items bought with each other
    itemsbyuser = []
    for item in user[key]:
        # make sure the item with dict is set up
        if not item in items:
            items[item] = {}
        # as we see each item, add to it others and others to it
        for other in itemsbyuser:
            insertOrIter(items[other], item, 1)
            insertOrIter(items[item], other, 1)
        itemsbyuser.append(item)
# now, unless i've screwed up my logic, we have a dictionary of items keyed
# to a dictionary of other items keyed to how many times they've been
# bought with the first item. *whew* 
# If you want something more (potentially) useful, we just turn that around to be a
# dictionary of items keyed to a list of tuples of (times seen, other item) and
# you're good to go.
useful = {}
for i in items:
    temp = []
    for other in items[i]:
        temp[].append((items[i][other], other))
    useful[i] = sorted(temp, reverse=True)
# Now you should have a dictionary of items keyed to tuples of
# (number times bought with item, other item) sorted in descending order of
# number of times bought together
nearlymonolith
  • 946
  • 4
  • 7
  • Embrace the defaultdict! Never again check on the existence of a dict key before updating its value - just access the key and let the defaultdict initialize it if it isn't there. The easiest code to maintain is the code that doesn't exist. – PaulMcG Jun 25 '10 at 04:17
1

Rather strange seeing that every time you want to get the stats, all solutions above churn through entire database to get counts.

Would suggest to keep the data in flat, indexes and only get results for specific item, one at the time. If your item count is large, it will me more efficient.

from collections import defaultdict
from itertools import groupby

class myDB:
    '''Example of "indexed" "database" of orders <-> items on order'''
    def __init__(self):
        self.id_based_index = defaultdict(set) 
        self.item_based_index = defaultdict(set)

    def add(self, order_data):
        for id, item in order_data:
            self.id_based_index[id].add(item)
            self.item_based_index[item].add(id)

    def get_compliments(self, item):
        all_items = []
        for id in self.item_based_index[item]:
            all_items.extend(self.id_based_index[id])
        gi = groupby(sorted(all_items), lambda x: x)
        return dict([(k, len(list(g))) for k, g in gi])

Example of using it:

events = """1-hammer 
    1-screwdriver 
    1-nails 
    2-hammer 
    2-nails 
    3-screws 
    3-screwdriver 
    4-nails 
    4-screws"""

db = myDB()
db.add(
    [ map(str.strip,e.split('-')) for e in events.splitlines() ]
    )
# index is incrementally increased 
db.add([['5','plunger'],['5','beer']])

# this scans and counts only needed items
assert db.get_compliments('NotToBeFound') == {}
assert db.get_compliments('hammer') == {'nails': 2, 'hammer': 2, 'screwdriver': 1}
# you get back the count for the requested product as well. Discard if not needed.

This is all fun, but, seriously, just go for real database storage. Because indexing is already built into any DB engine, all of the above code in SQL would just be:

select
    p_others.product_name,
    count(1) cnt
from products p
join order_product_map opm
    on p.product_id = opm.product_id
join products p_others
    on opm.product_id = p_others.product_id
where p.product_name in ('hammer')
group by p_others.product_name
ddotsenko
  • 4,926
  • 25
  • 24
  • The solution I proposed can be used in the caching layer, which will not require any DB for different queries. In fact, if updates occurs you can update the structure (very trivial to do) and then do a delayed write to DB or whatever. Besides, OP asked for a data structure, and he got it (and not just in my answer, but the others too). Going to the database each time does not scale (especially if you have multiple joins in your queries!). Caching becomes a must. Also, I presume if OP wanted a SQL query, he would have asked for it. –  Jun 25 '10 at 02:00
  • Agreed - the solutions posted weren't suggesting that this elaborate process be done every time, but rather that doing said process would get you the data you want (which could then be stored, indexed, put in a database, whatever the case may be). It's a server side operation running intermittently, and pages would just access the data assuming it were already computed. As to the SQL, that's why I suggested MongoDB - I had a similar thought that this code would be nicely accomplished natively on the DB. That's a nice bit of SQL, though (never thought I'd say that). – nearlymonolith Jun 25 '10 at 13:26