UPDATE 26-May-16: tried new algorithm. See bottom.
I'm looking for suggestions for an algorithm and toolset to get frequency counts of pairs of items. For those familiar with it, this is similar to the "market basket" problem (the "beer and diapers" model), except I need the frequency count of EVERY pair which occurs.
I have about 5 million records. Each record is a list of 10 to 300 items. The items are integers ranging from 1 to approximately 250,000. So, for example:
1: [85708, 28302, 1045, 20395]
2: [20382, 3092, 2933, 20993, 58585, 4855, 112393, 38347, 20447, 33892]
3: [118082, 30282, 2859, 585, 1045, 20395, 2383, 85855, 182582, 223]
I want to generate a table to answer the question:
for any pair of 2 items, how many times do they occur in the same record?
For example, record 1 generates pairs: (85708, 28302), (85708, 1045), (85708, 20395), (28302, 1045), (28302, 20395) and (1045, 20395). I want to tally up the frequency of each of these pairs in the whole dataset. [Order doesn't matter].
To give an idea of the size it needs to handle: The mean length of the records is 85 items. For a record of that length, that's 3655 (=86*85/2) pairs of items which need to be counted. For 5 million records at that length, that's 18 billion pairs of items that need to be tallied. In most runs, the median length of records is much lower than the mean (most records contain <18 items, while a few records contain much more), so the actual number of pairs probably won't reach 18 billion, but it could definitely be a few billion.
The frequency distribution of single items follows a power law, with a few high frequency items and many low frequency items; on the most recent run with a larger-than-normal size, there ended up being around 2 billion distinct pairs of items that have a frequency of >0. The vast majority of potential pair combinations do not occur; each run is different, but I'd estimate that at most 15% of possible pair combinations will occur, and in most cases, less than 2% will occur.
I've got a program which works accurately, but it's very slow. I'd now like to optimize it for speed. It's brute force using Python and MySql:
- In Python, get the items for a batch of 1,000 records.
- Using python's
itertools.combinations
, loop through record by record and generate per record all the pair combinations of the items. - Store the results in a sql db. I have a table in the db with 3 fields:
item1 (int), item2 (int), frequency (int), primary key (item1, item2)
. For each pair combination of items we calculated, do ainsert... on duplicate key update
: i.e., if the pair doesn't exist in this table, insert the pair with the frequency of 1. If that pair does exist, increase the frequency of that pair by 1. - Repeat loop for the next batch of 1,000 records.
It took around 15 hours to process. When I wrote this a while ago, the time didn't matter, I just needed to run it once to get static results that never had to be updated. But now the input records will be changing, and I need to optimize so that I can re-generate results at least once a day. The results need to be in a form which can be used for very fast lookups of the frequency of an item-item pair; I thought something like an indexed db table.
I changed my brute force program very basically to improve the efficiency by playing with the number of read-write batches; the huge part of the processing time occurs at the step of "insert the pair if it doesn't exists/increment the pair frequency count if it does exist". My little tweaks shaved the processing time down around 15%.
Another tweak comes because I already have the frequency of each single item, so I could try to "pre-seed" the database with the most likely frequent combinations (say, the top 5,000 x 5,000), then in Python divide the pair combinations I find into two groups based on their item number: "definitely in the db" and "don't know if it's in the db." It would save some time for the db, but at the cost of having Python need to keep track of the frequent items and divide them....
So I could keep doing tweaks like that and save a few more percent here and there, but I'd like to do this right and re-write the procedure from scratch now with a good algorithm and good tools rather than waste time tweaking a bad process which was just quickly cobbled together for a one-off use and was never planned for efficiency.
It has to run on a user's single stand-alone desktop (standard specs), no outside storage or distributed computing.
Ideally, I'd like to run the process from python. Numpy, scipy, blas/lapack are all ok. (I looked at python's collections.counter
per this answer to a related question, but I think the size I have is too big; tell me if that's wrong and Counter
could be effective).
My problem is similar to the market basket problem
, which originally comes from a store which records the items a customer buys in one basket (and led to the famous conclusion that people who buy diapers are unusually likely to buy beer) [thanks to @lzcig for the link to this good description of the market basket problem]. Strategies for the market basket problem filter item pairs down to the most frequent pairs and doesn't count anything that doesn't fit in main memory. But in my case, I need to count EVERY pair which occurs, even if it occurs just one time. So I need an algorithm and toolset to efficiently store and index all this. I don't want to reinvent the wheel, and I'd really like to find a solution which can efficiently handle this.
What would you recommend as the best solution?
UPDATE (26-May-16): I developed a solution which accurately tallies a full dataset of a few billion pairs in 2 hours. Basic idea:
- Take advantage of the power law distribution and the fact that I already have calculated the frequency of the single items. Pairs consisting of the top few thousand items represent a large percent of the total pairs.
- Build a 1-dimensional array to hold the count of the most frequent pairs. Half the values of an i x j matrix would be wasted since the order of the pair doesn't matter [ (a,b) counts the same as (b,a) ], so I can save space by packing them into a single k-id (convert (i,j) into a k-index of the upper triangle of the i x j matrix). I dynamically size the array based on the frequency distribution of the single items and the available memory. I've found that 3,000 x 5,000 (stored in an array of 10.5mln ids) works well.
- I built the array with a native Python array. Similar to this answer, I found that in the case of the simple array access and increment counter that I'm doing, native Python takes much more memory than numpy but is much faster.
- Process each record. For each pair, if the items are in the most frequent group, then increment the counter of its id in the array. If not, add the pair to a list of the low-frequency pairs.
- When memory gets tight, sort the array of low-frequency pairs and write it to a new file.
- At the end of the processing, do a merge heapq of the (many) sorted files to create one file with all the low-frequency pairs. Go through that, get counts for each unique pair. Then finally, translate the high-frequency array into pair-count values and merge that with the low frequencies. The result is a file with pair-frequency, in sorted order.
It depends a lot on maxing out system memory. I monitor memory use throughout to try to get as much as possible. The bottleneck is the disk read/writes: merging hundreds of huge files is much more intense than I'd thought it would be. So I've played around with settings to reduce the number of files: merging a few massive files is better than merging many smaller files.
On 4gb RAM, it takes a little under 2 hours to process the most recent batch of 5 million records which had a few billion pairs. It's definitely better than my initial 15 hours, but it feels pretty hacky and I'm sure there must be better ways to just count pairs. Please let me know if you have any ideas.