Heuristic for finding elements that appears often together in a big data set

Question

Problem:

I have a list of millions of transactions. Each transaction contains items (eg 'carrots', 'apples') the goal is to generate a list of pair of items that frequently occur together in individual transactions. As far as I can tell doing an exhaustive search isn't feasible.

Solution attempts:

So far I have two ideas. 1) Randomly sample some appropriate fraction of transactions and only check those or 2) count how often each element appears, use that data to calculate how often elements should appear together by chance and use that to modify the estimate from 1.

Any tips, alternative approaches, ready-made solutions or just general reading suggestions are much appreciated.

Edit:

Some additional information from the comments

Number of diffent items: 1,000 to 100,000

Memory constraint: A few gigs of ram at the most for a few hours.

Frequency of use: More or less a one off.

Available resources: 20-100 hours of newbie programmer time.

Desired result list format: Pair of items and some measure how often they appear, for the n most frequent pairs.

Distribution of items per transactions: Unknown as of now.

How many items do you have are they again in the order of millions? — Ivaylo Strandjev, Jan 04 '13 at 14:17
Definitely an interesting question. I believe both points that you propose are good suggestions and it's worth trying using them. — Ivaylo Strandjev, Jan 04 '13 at 14:31
What is the memory constraint? How often do you need the list re-generated? What resources can you devote to this task? How many different items are there in your universe? What part of the list of pairs do you want? What is the distribution of the number of items per transaction? — Deer Hunter, Jan 04 '13 at 14:50
Memory and frequency: should be run more or less as a one-off on a ordinary workstation. So something like a few gigs of ram would be acceptable. — RalleG, Jan 04 '13 at 15:00
Resources: Basically me (first year student of physics with 1 course of programming) putting in 20-100 hours. — RalleG, Jan 04 '13 at 15:01
Number of items in universe: No real constraints as far as I'm aware. — RalleG, Jan 04 '13 at 15:02
Desired list format: Pair of items and some measure how often they appear, for the n most frequent pairs. — RalleG, Jan 04 '13 at 15:03
RalleG, this is a very parallelizable (even in the map/reduce meaning) task; my first approach would be brute-forcing the data set through a) assigning natural numbers to all the item varieties, b) generating the list of pairs from all transactions with the smaller number coming first, c) running `sort | uniq -c` on the dataset. — Deer Hunter, Jan 04 '13 at 15:22
Do you know anything about the "coverage" of all possible pairs? If you expect most of the possible pairs have at least one occurrence, it's going to be hard to do in less than 10 GB of RAM, but if you think only a smaller subset, you can get away with a tighter data structure. — Xavier Holt, Jan 04 '13 at 16:03
Random samples work very well, just consider intra-day predictions on any kind of election. They're usually within 0.1-0.2% of the final result. Assuming that elections are not all fake to begin with (a valid hypothesis, though) the accurracy of a few thousand random samples is stunning. With that in mind, I wouldn't waste much of a thought on finding a better solution than pulling a good number of random samples from the db. — Damon, Jan 04 '13 at 16:19

score 7 · Accepted Answer · edited May 23 '17 at 12:14

Let the number of transactions be n, the number of items be k, and the average size of a transaction be d.

The naive approach (check pair in all records) will give you O(k^2 * n * d) solution, not very optimal indeed. But we can improve it to O(k*n*d), and if we assume uniform distribution of items (i.e. each items repeats on average O(n*d/k) times) - we might be able to improve it to O(d^2 * n + k^2) (which is much better, since most likely d << k).

This can be done by building an inverted index of your transactions, meaning - create a map from the items to the transactions containing them (Creating the index is O(nd + k)).

Example, if you have transactions

transaction1 = ('apple','grape')
transaction2 = ('apple','banana','mango')
transaction3 = ('grape','mango')

The inverted index will be:

'apple' -> [1,2]
'grape' -> [1,3]
'banana' -> [2]
'mango' -> [2,3]

So, after understanding what an inverted index is - here is the guidelines for the solution:

Build an inverted index for your data
For each item x, iterate all documents it appears in, and build a histogram for all the pairs (x,y) such that y co-occures with x.
When you are done, you have a histogram containing k^2 items, which you need to process. This question discusses how to get top-k elements out of an unsorted list.

Complexity analysis:

Building an inverted index is O(nd+k)
Assuming each element repeats in O(nd/k) transactions, each iteration takes O(nd/k * d) time, and you have k iterations in this step, so you get O(nd^2 + k) for this step.
Processing the list can be done in O(k^2logk) if you want full ordering, or if you just want to print top X elements, it can be done in O(k^2).

Totaling in O(nd^2 + k^2) solution to get top-X elements, which is MUCH better then naive approach, assuming d << k.

In addition, note that the bottleneck (step 2) can be efficiently parallelized and distributed among threads if needed.

@RalleG: You are most welcome. I hope it helps. Note that building an inverted index helps a lot when talking about a lot of collections/documents which contains much fewer items/terms/words - and thus is often used by search engines (just an FYI addition). — amit, Jan 04 '13 at 19:26

NoSenseEtAl · Answer 2 · 2013-01-04T16:48:20.513

If the number of items ordered in one purchase is small(<10) do this:
have map of maps(dictionary of dictionaries) : key in the first map is item,
value in the first map is map whose key is second item, value count how many times it appeared in the purchase with first key.

So go through every order and for every pair update map. At the end go through map of maps and look for "big values" in the "second value"

Note: depending on the size and "distribution"of input data you might end up with not enough memory

Heuristic for finding elements that appears often together in a big data set

2 Answers2