Came across this interview programming test recently:
- You're given a list of top 50 favorite artists for 1000 users (from last.fm)
- Generate a list of all artist pairs that appear together at least 50 times.
- The solution can't store in memory, or evaluate all possible pairs.
- The solution should be scalable to larger datasets.
- The solution doesn't have to be exact, ie you can report pairs with a high probability of meeting the cutoff.
I feel I have a pretty workable solution, but I'm wondering if they were looking for something specific that I missed.
(In case it makes a difference - this isn't from my own interviewing, so I'm not trying to cheat any prospective employers)
Here are my assumptions:
- There's a finite maximum number of artists (622K according to MusicBrainz), while there is no limit on the number of users (well, not more than ~7 billion, I guess).
- Artists follow a "long tail" distribution: a few are popular, but most are favorited by a very small number of users.
- The cutoff, is chosen to select a certain percentage of artists (around 1% with 50 and the given data) so it will increase as the number of users increases.
The third requirement is a little vague - technically, if you have any exact solution you've "evaluated all possible pairs".
Practical Solution
first pass: convert artist names to numeric ids; store converted favorite data in a temp file; keep count of user favorites for each artist.
Requires a string->int map to keep track of assigned ids; can use a Patricia tree if space is more important than speed (needed 1/5th the space and twice the time in my, admittedly not very rigorous, tests).
second pass: iterate over the temp file; throw out artists which didn't, individually, meet the cutoff; keep counts of pairs in a 2d matrix.
Will require
n(n-1)/2
bytes (or shorts, or ints, depending on the data size) plus the array reference overhead. Shouldn't be a problem sincen
is, at most, 0.01-0.05 of 622K.
This seems like it can process any sized real-world dataset using less than 100MB of memory.
Alternate Solution
If you can't do multiple passes (for whatever contrived reason), use an array of Bloom filters to keep the pair counts: for each pair you encounter, find the highest filter it's (probably) in, and add to the next highest one. So, first time it's added to bf[0], second time bf[1], and so on until bf[49]. Or can revert to keeping actual counts after a certain point.
I haven't run the numbers, but the lowest few filters will be quite sizable - it's not my favorite solution, but it could work.
Any other ideas?