2

UPDATE 26-May-16: tried new algorithm. See bottom.


I'm looking for suggestions for an algorithm and toolset to get frequency counts of pairs of items. For those familiar with it, this is similar to the "market basket" problem (the "beer and diapers" model), except I need the frequency count of EVERY pair which occurs.

I have about 5 million records. Each record is a list of 10 to 300 items. The items are integers ranging from 1 to approximately 250,000. So, for example:

1: [85708, 28302, 1045, 20395]
2: [20382, 3092, 2933, 20993, 58585, 4855, 112393, 38347, 20447, 33892]
3: [118082, 30282, 2859, 585, 1045, 20395, 2383, 85855, 182582, 223]

I want to generate a table to answer the question:

for any pair of 2 items, how many times do they occur in the same record?

For example, record 1 generates pairs: (85708, 28302), (85708, 1045), (85708, 20395), (28302, 1045), (28302, 20395) and (1045, 20395). I want to tally up the frequency of each of these pairs in the whole dataset. [Order doesn't matter].

To give an idea of the size it needs to handle: The mean length of the records is 85 items. For a record of that length, that's 3655 (=86*85/2) pairs of items which need to be counted. For 5 million records at that length, that's 18 billion pairs of items that need to be tallied. In most runs, the median length of records is much lower than the mean (most records contain <18 items, while a few records contain much more), so the actual number of pairs probably won't reach 18 billion, but it could definitely be a few billion.

The frequency distribution of single items follows a power law, with a few high frequency items and many low frequency items; on the most recent run with a larger-than-normal size, there ended up being around 2 billion distinct pairs of items that have a frequency of >0. The vast majority of potential pair combinations do not occur; each run is different, but I'd estimate that at most 15% of possible pair combinations will occur, and in most cases, less than 2% will occur.

I've got a program which works accurately, but it's very slow. I'd now like to optimize it for speed. It's brute force using Python and MySql:

  • In Python, get the items for a batch of 1,000 records.
  • Using python's itertools.combinations, loop through record by record and generate per record all the pair combinations of the items.
  • Store the results in a sql db. I have a table in the db with 3 fields: item1 (int), item2 (int), frequency (int), primary key (item1, item2). For each pair combination of items we calculated, do a insert... on duplicate key update: i.e., if the pair doesn't exist in this table, insert the pair with the frequency of 1. If that pair does exist, increase the frequency of that pair by 1.
  • Repeat loop for the next batch of 1,000 records.

It took around 15 hours to process. When I wrote this a while ago, the time didn't matter, I just needed to run it once to get static results that never had to be updated. But now the input records will be changing, and I need to optimize so that I can re-generate results at least once a day. The results need to be in a form which can be used for very fast lookups of the frequency of an item-item pair; I thought something like an indexed db table.

I changed my brute force program very basically to improve the efficiency by playing with the number of read-write batches; the huge part of the processing time occurs at the step of "insert the pair if it doesn't exists/increment the pair frequency count if it does exist". My little tweaks shaved the processing time down around 15%.

Another tweak comes because I already have the frequency of each single item, so I could try to "pre-seed" the database with the most likely frequent combinations (say, the top 5,000 x 5,000), then in Python divide the pair combinations I find into two groups based on their item number: "definitely in the db" and "don't know if it's in the db." It would save some time for the db, but at the cost of having Python need to keep track of the frequent items and divide them....

So I could keep doing tweaks like that and save a few more percent here and there, but I'd like to do this right and re-write the procedure from scratch now with a good algorithm and good tools rather than waste time tweaking a bad process which was just quickly cobbled together for a one-off use and was never planned for efficiency.

It has to run on a user's single stand-alone desktop (standard specs), no outside storage or distributed computing.

Ideally, I'd like to run the process from python. Numpy, scipy, blas/lapack are all ok. (I looked at python's collections.counter per this answer to a related question, but I think the size I have is too big; tell me if that's wrong and Counter could be effective).

My problem is similar to the market basket problem, which originally comes from a store which records the items a customer buys in one basket (and led to the famous conclusion that people who buy diapers are unusually likely to buy beer) [thanks to @lzcig for the link to this good description of the market basket problem]. Strategies for the market basket problem filter item pairs down to the most frequent pairs and doesn't count anything that doesn't fit in main memory. But in my case, I need to count EVERY pair which occurs, even if it occurs just one time. So I need an algorithm and toolset to efficiently store and index all this. I don't want to reinvent the wheel, and I'd really like to find a solution which can efficiently handle this.

What would you recommend as the best solution?


UPDATE (26-May-16): I developed a solution which accurately tallies a full dataset of a few billion pairs in 2 hours. Basic idea:

  • Take advantage of the power law distribution and the fact that I already have calculated the frequency of the single items. Pairs consisting of the top few thousand items represent a large percent of the total pairs.
  • Build a 1-dimensional array to hold the count of the most frequent pairs. Half the values of an i x j matrix would be wasted since the order of the pair doesn't matter [ (a,b) counts the same as (b,a) ], so I can save space by packing them into a single k-id (convert (i,j) into a k-index of the upper triangle of the i x j matrix). I dynamically size the array based on the frequency distribution of the single items and the available memory. I've found that 3,000 x 5,000 (stored in an array of 10.5mln ids) works well.
  • I built the array with a native Python array. Similar to this answer, I found that in the case of the simple array access and increment counter that I'm doing, native Python takes much more memory than numpy but is much faster.
  • Process each record. For each pair, if the items are in the most frequent group, then increment the counter of its id in the array. If not, add the pair to a list of the low-frequency pairs.
  • When memory gets tight, sort the array of low-frequency pairs and write it to a new file.
  • At the end of the processing, do a merge heapq of the (many) sorted files to create one file with all the low-frequency pairs. Go through that, get counts for each unique pair. Then finally, translate the high-frequency array into pair-count values and merge that with the low frequencies. The result is a file with pair-frequency, in sorted order.

It depends a lot on maxing out system memory. I monitor memory use throughout to try to get as much as possible. The bottleneck is the disk read/writes: merging hundreds of huge files is much more intense than I'd thought it would be. So I've played around with settings to reduce the number of files: merging a few massive files is better than merging many smaller files.

On 4gb RAM, it takes a little under 2 hours to process the most recent batch of 5 million records which had a few billion pairs. It's definitely better than my initial 15 hours, but it feels pretty hacky and I'm sure there must be better ways to just count pairs. Please let me know if you have any ideas.

Community
  • 1
  • 1
SarapoonJ
  • 49
  • 2
  • I don't think you'll find anything "out of the box" which does exactly what you need. However, it is quite similar to the algorithm for `market basket analysis`, and there's a good description on an efficient implementation in the book [mining of massive datasets](http://www.mmds.org/#ver21), chapter 6. Maybe it helps. – lrnzcig May 24 '16 at 10:19
  • @lrnzcig: Yes, it is a market basket analysis (the "beer and diapers" model haha). But their methods filter to focus on most frequent pairs and require it fit in main memory; in my case, I need the count for EVERY pair which occurs, even if only one time. So I need an efficient algorithm and a toolset for storing and indexing the item pairs. Edited the question to be clear on this, thanks for the good comment. – SarapoonJ May 24 '16 at 12:21
  • Did you try to simply do the counting in a SQL join + groupyBy ? I mean bulk load the pair (integer, record). Also this may not fit what you need to do after but maybe doing the count only at query time in an SQL query instead of a simple SQL lookup may be acceptable. Storing all the pair is huge so the tradeoff may be good in your scenario. –  May 24 '16 at 13:04
  • @Blabla404: Counting at query time won't work. Has to be pre-counted and quick to do lookups... As for join + groupBy on billions of records in SQL? Unordered, unindexed, no, doesn't work. Having an index first would be obviously better, but building that index and then doing the massive group by is longer than just doing the count myself. – SarapoonJ May 26 '16 at 03:38
  • 1
    If you really want all vs. all frequencies (I'm not convinced you do, incidentally -- no one has time to query more than a small fraction of them!) then I think all solutions are somewhat hacky. And bear in mind that you can't possibly do better than the amount of time it takes to write out a 180Gb file to disk (assuming each pair and its frequency takes on average 10 bytes to store). – j_random_hacker May 26 '16 at 10:43

1 Answers1

1

You can print out all pairs of distinct elements for each record, then leverage the well-engineered sort command available in any Unix to group identical pairs together, before finally counting the number of lines in each identical block with uniq -c:

perl -lne '($_) = /\[(.*)\]/ or die; @x = sort { $a <=> $b } split /, /; for ($i = 0; $i < @x - 1; ++$i) { for ($j = $i + 1; $j < @x; ++$j) { print "$x[$i] $x[$j]"; } }' | sort -g | uniq -c > outfile

This will take a long time for 18 billion rows, but it should be faster than repeatedly updating a B*-tree, which is what an SQL database is very likely doing internally. (To put it another way: If updating a B*-tree actually was faster than this, then all implementations of sort would internally do that instead, too.) You will have to try it out and see.

To query this "database", you can just binary-search outfile -- no need to load the whole thing into memory. (You might want to turn it into some more compact binary format first, but this isn't actually necessary -- you can still perform binary search on the plain text file just by always reading forward till you hit a \n after each seek. Once the range you're searching gets small enough, you might want to read it into memory in its entirety and continue binary-searching in-memory.)

If you don't care for Perl, I'm sure you could code that first part in Python or any other language.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • Interesting idea, thanks. The query of the "database" wouldn't work well for what I want: I need it to be fast, and, once built and indexed, the sql lookups and relationships are very quick in linking the results back to the underlying data.... But your idea of using sort on a massive data dump is interesting. I played around with it: hits up on memory and disk issues. I updated my question. Thanks. – SarapoonJ May 26 '16 at 03:44
  • "Hits up" = ? Ideally the `sort` command would have its own option to merge *and count* duplicate rows, as this would eliminate large numbers of lines in the later merges, but unfortunately that seems not to be the case. It is optimised to use as much RAM as it thinks it can get away with though (I think this can be adjusted through options). – j_random_hacker May 26 '16 at 10:34
  • BTW, the B*-tree lookups your DB is doing are essentially just k-ary tree lookups -- that is, binary searches, but with each disk block read, say, 1 of 2048 subranges (instead of 1 of 2 subranges) will be chosen. This will actually only be about log2(2048/2) = 10 times faster than the binary search. – j_random_hacker May 26 '16 at 10:37