4

Recently I have been working with combinations of words to make "phrases" in different languages and I have noticed a few things that I could do with some more expert input on.

Defining some constants for this,

Depths (n) is on average 6-7

The length of the input set is ~160 unique words.

  1. Memory - Generating n permutations of 160 words wastes lots of space. I can abuse databases by writing it to disk, but then I take a hit in performance as I need to constantly wait for IO. The other trick is to generate the combinations on the fly like a generator object
  2. Time - If Im not wrong n choose k gets big fast something like this formula factorial(n) / (factorial(depth) * (factorial(n-depth))) this means that input sets get huge quickly.

My question is thus.

Considering I have an function f(x) that takes a combination and applies a calculation that has a cost, e.g.

func f(x) {
    if query_mysql("text search query").value > 15 {
        return true
    }
    return false 
}

How can I efficiently process and execute this function on a huge set of combinations?

Bonus question, can combinations be generated concurrently?

Update: I already know how to generate them conventionally, its more a case of making it efficient.

Jakob Bowyer
  • 33,878
  • 8
  • 76
  • 91
  • does the 'depth' remain constant during a computation. So for one run of the algorithm your output is all `depth=6` combinations of words from the 160 length input or all combinations of words in the range `[1,6]`? – Matti Lyra Feb 10 '14 at 09:42
  • @MattiLyra Ideally I want depth to go from [n..2] but let me focus on one for now, and removed line for clarity – Jakob Bowyer Feb 10 '14 at 09:44
  • Well that would give you one way of parallelising the combinations, run each of the `n=3, n=4 ... n=n` in parallel as they don't depend on each other. – Matti Lyra Feb 10 '14 at 09:46
  • @MattiLyra It might be wasted calculation if the results from one collection of combinations are always in the enxt – Jakob Bowyer Feb 10 '14 at 09:48
  • 2
    I have no idea who voted to close this question. This question is perfectly fine for SO. – amit Feb 10 '14 at 11:00

2 Answers2

1

One approach will be to first calculate how much parallelism you can get, based on the number of threads you've got. Let the number of threads be T, and split the work as follows:

  • sort the elements according to some total ordering.
  • Find the smallest number d such that Choose(n,d) >= T.
  • Find all combinations of 'depth' (exactly) d (typically much lower than to depth d, and computable on one core).
  • Now, spread the work to your T cores, each getting a set of 'prefixes' (each prefix c is a combination of size d), and for each case, find all the suffixes that their 'smallest' element is 'bigger' than max(c) according to the total ordering.

this approach can also be translated nicely to map-reduce paradigm.

map(words): //one mapper
   sort(words) //by some total ordering function
   generate all combiations of depth `d` exactly // NOT K!!!
   for each combination c produced:
       idx <- index in words of max(c) 
       emit(c,words[idx+1:end])
reduce(c1, words): //T reducers
   combinations <- generate all combinations of size k-d from words
   for each c2 in combinations:
      c <- concat(c1,c2)
      emit(c,f(c))
amit
  • 175,853
  • 27
  • 231
  • 333
0

Use one of the many known algorithms to generate combinations. Chase's Twiddle algorithm is one of the best known and perfectly suitable. It captures state in an array, so it can be restarted or seeded if wished.

See Algorithm to return all combinations of k elements from n for lots more.

You can progress through your list at your own pace, using minimal memory and no disk IO. Generating each combination will take a microscopic amount of time compared to the 1 sec or so of your computation.

This algorithm (and many others) are easily adapted for parallel execution if you have the necessary skills.

Community
  • 1
  • 1
david.pfx
  • 10,520
  • 3
  • 30
  • 63
  • This doesn't answer the question in the slightest. – Jakob Bowyer Feb 10 '14 at 10:06
  • 1
    Then you should ask better questions. If you want an answer that explains how to use map reduce or Hadoop then say so. – david.pfx Feb 10 '14 at 10:39
  • If you learn to read, I say, I already know how to generate combinations, I also run out of memory when storing state in an array. – Jakob Bowyer Feb 10 '14 at 10:41
  • If you check the algorithm, the required state is N+2 ints. Since you cannot run out of memory allocating 162 ints, I must assume you are using the wrong algorithm. With the right algorithm, the rest is easy. – david.pfx Feb 10 '14 at 13:19