3

My problem is this: I want to find all n length combinations of m possible digits such that the average of the digits is greater than a threshold value X.

For example, say length n=3 and digits are {1, 2} and threshold is 1.5. Total allowable combinations are 2*2*2 == 2**3 = 8 viz

222 - avg 2.000000 > 1.500000 -> include in acceptable set
221 - avg 1.666667 > 1.500000 -> include in acceptable set
212 - avg 1.666667 > 1.500000 -> include in acceptable set
211 - avg 1.333333 < 1.500000 -> adding 1 and below to the exclude list
122 - avg 1.666667 > 1.500000 -> include in acceptable set
121 - avg 1.333333 < 1.500000 -> skipping this vote combo
112 - avg 1.333333 < 1.500000 -> skipping this vote combo
111 - avg 1.000000 < 1.500000 -> skipping this vote combo

final list of valid  votecombos
[[2, 2, 2], [2, 2, 1], [2, 1, 2], [1,2,2]]

The way, I think, to solve this problem is to imagine a tree of all possible combinations and then dynamically prune the tree for impossible solutions. For example imagine n=3 level tree like this

                            root
                         /        \
                        1          2
                    /      \    /     \
                   1       2    1     2
                 /  \    /  \  /  \  /  \
                1   2   1   2  1  2  1   2

Each path to a leaf is a possible combination. As you can imagine for levels n=3 and m=5, the number of nodes is N == m**n == 5**3 == 125' nodes. Its easy to see that the tree gets really really large even form=5andn=20` approx. 96 trillion nodes. So the tree cannot be stored in memory. But it also doesnt have to be since it is very structured.

The way to get all possible valid combinations is by DFS traversing the tree in a kind of pre-order way but also keep on pruning the tree at the same time as you traverse. For example, in the above example, the first three combinations {222, 221, 212} are valid however 211 is not valid. It also means that any other combination containing two 1's from that point on is not going to be valid. So we can pretty much prune the whole left part of the tree with root 1 except 122 ! This would have helped us avoid checking the 3 combinations.

To do this I wrote a simple python script

import string
import itertools
import numpy as np
import re

chars = '21'
grad_thr = 1.5
seats = 3

excllist = []
validlist = []

for word in itertools.product(chars, repeat = seats):
    # form the string of digits
    votestr = ''.join(word)
    print (votestr)

    # convert string into list of chars
    liststr = list(votestr)
    #print liststr

    # map list of chars to list of ints
    listint = map(int, liststr)

    if len(list(set(listint) & set(excllist))) == 0:
        # if there are no excluded votes in this votecombo then proceed

        # compute a function over the digits; func can be average/bayesian score/something else.
        y_mean = np.mean(listint)
        print 'avg %f' %y_mean
        #y_bayes = bayesian score

        if y_mean >= grad_thr:
            # if function result is greater than grad threshold then save result to a list of valid votes
            validlist.append(listint)
            print 'geq than %f -> include in acceptable set' %grad_thr

        elif y_mean < grad_thr:
            # if function result is not greater than grad threshold then add logic to stop searching the tree further
            # prune unnecessary part of the tree

            if listint[-1] not in excllist:
                excllist = [int(d) for d in range(listint[-1] + 1)]
                print 'adding %d and below to the exclude list' %listint[-1]
            else:
                print '%d already present in exclude list' %listint[-1]

    else:
        print 'skipping this vote combo'

print '\nfinal valid list of votecombos'
print validvotelist
print 'exclude list'
print excllist
print '\n'

In this way I go through every possible combo and skip to aviod computing the average. However I still end up checking each possible combo after I have entered the for-loop.

Is it possible to not check the combo at all? i.e. we know combo 121 does not work, however we still have to enter the for loop and then skip the combo. Is it possible to not do that?

Manas
  • 399
  • 1
  • 4
  • 13
  • Why is `n=4` in your first example? It looks like you only take strings of length `3`. And why is the average of `122` not `1.667`? – A.E Jun 25 '14 at 22:13
  • Sorry for that, edited n=3 – Manas Jun 25 '14 at 22:21
  • I still want to know how the average of the digits of `122` is `1.33`, and moreover how the average of the digits of `111` is also `1.33`... – A.E Jun 25 '14 at 22:23
  • corrected the mistake hopefully – Manas Jun 25 '14 at 22:24
  • Can there be "gaps" in the set of digits? E.g. could the set of digits be {4, 6}? – j_random_hacker Jun 25 '14 at 22:42
  • Not in my case. However, I dont see how the problem changes even if there are gaps. e.g. then the tree would be [root,[46],[46,46],[46,46,46,46]...] and so on. – Manas Jun 25 '14 at 22:53
  • 1
    I have in mind a different algorithm that starts with a multiset of n copies of the highest digit at the root of the DFS tree, and decreases some digit in each DFS edge. This leads to efficient pruning. It will work with arbitrary digit sets, but it's simpler if there are no holes, since it means that a child will always have sum exactly 1 less than its parent. – j_random_hacker Jun 25 '14 at 23:05

2 Answers2

1

Some suggestions:

  1. Build multisets of numbers instead of ordered lists. The average of an ordered list doesn't depend on the order of the numbers in it, and each multiset of numbers corresponds to many ordered lists, so you can save a lot of memory by keeping just the multisets, and generating all corresponding ordered lists from them when you need to.
  2. Instead of starting from an empty multiset and building it up by adding a number in each DFS edge, start with a full multiset containing n copies of the highest digit, and in each DFS edge, decrease one of the digits by 1. (This assumes that there are no "gaps" in the set of available digits.) The advantage here is that we know that traversing a DFS edge downwards can only ever decrease the average, so if doing so produces an average below the threshold, we know we can prune the branch completely, as all deeper descendants must have even lower averages.
  3. You don't actually need a single division anywhere: all you need to do is multiply the threshold x by n to get a minimum digit sum at the very start, to which you can then compare sums of multisets. Furthermore, with the previous suggestion for how to generate children, the sum of a child is always exactly 1 less than the sum of its parent, so we don't even need a loop to calculate the sum -- it's a constant-time operation.

Avoiding duplicates

The rule for generating children in (2) above does pose one difficulty however: how do we make sure that we don't generate a child more than once? E.g. if we had one node in the tree that contained the multiset {5, 8} (which in this example is just a plain set), then this would generate the children {4, 8} and {5, 7}; but if we had another node in the tree somewhere for the set {4, 9}, then this would generate the children {3, 9} and {4, 8} -- so the child {4, 8} would get generated twice.

The way around this is to figure out a rule by which each child can "pick" a unique parent, and then arrange things so that parents only generate the children for which they would be the "picked" parent. E.g. we could say that a child should pick as its unique parent the parent that, among all the parents that could generate it, is lexicographically largest when its elements are listed in increasing order. (You could also pick the lexicographically smallest, but the largest turns out to be more computationally efficient.) For the example multiset {4, 8}, the two parents that could generate it are {5, 8} and {4, 9}; of these, {5, 8} is lexicographically larger, so we will pick it as the parent.

But during DFS we generate children from parents and not the other way round, so we still need to transform this rule for "picking a parent" into something that tells us, when we are at a node that is potentially the parent of some other node, whether it is in fact the "picked" parent for that child. To do this, consider all the potential parents of some child node v. First of all, how many are there? If v has r distinct digits that are less than the maximum digit value then there are r possible parents, each equal to v, but having 1 different digit larger by 1.

Suppose that the lowest digit in v is d, and that there are k >= 1 copies of it. Among the r potential parents of v, all of them will have k copies of d as well -- except for one parent, u, that will have k-1 copies, because in this parent it was the digit d+1 that had to decrease by 1 to d (thereby increasing the number of copies of d from k-1 to k) to produce v. Now if we write out the r potential parents of v with their digits in increasing order, notice that all of them except for u will begin with k copies of d, while u begins with k-1 (which is possibly 0) copies of d followed by (at least 1 copy of) d+1. Thus u is lexicographically larger than all the other r-1 potential parents of v.

This tells us the criteria for being a picked parent from the perspective of a potential parent node. Suppose the smallest digit in u is d. Then some node v has u as its picked parent if and only if v can be formed either by decreasing a d-digit in u to d-1, or by decreasing a (d+1)-digit in u to d. This translates into two simple and efficient rules for generating children:

Suppose we are at some node u, and want to generate all of its children for the DFS according to the above rule, so that every satisfying multiset is generated exactly once across the tree. As before, let d be the smallest digit in u. Then:

  • If d > smallest_digit, generate a child that is the same as u, except that one of the d-digits in u has been decreased to d-1. Example: {3, 3, 3, 4, 6, 6} should generate the child {2, 3, 3, 4, 6, 6}.
  • If u contains a digit d+1, generate a child that is the same as u, except that one of the (d+1)-digits has been decreased to d. Example: {3, 3, 3, 4, 6, 6} should generate the child {3, 3, 3, 3, 6, 6}.

So in the above example, the node u = {3, 3, 3, 4, 6, 6} will generate 2 children. (No node will ever generate more than 2 children.)

If multisets are represented either as sorted lists or as frequency counts of digits in sorted order, then both conditions can be checked efficiently just by scanning the initial segment.

Example

On your example (remember that we only generate multisets here; generate every permutation of them in a separate step to find all ordered lists):

sum_threshold = 1.5*3 = 4.5

                                       SUM
                        222             6
                      /
                    122                 5
                  /
                112                     4
               PRUNE

On a slightly larger example, with digits = {1, 2, 3}, n = 3 and x = 0 (to show that all multisets will be generated, exactly once):

                                       SUM
                       333              9
                     /
                   233                  8
                 /     \
               133     223              7
                      /  \
                    123  222            6
                    /      \
                  113      122          5
                             \
                             112        4
                               \
                               111      3
j_random_hacker
  • 50,331
  • 10
  • 105
  • 169
  • I like this approach. In a sense, my solution generates the leaves of the tree constructed by your solution. From there you can build up all other solutions, but it ends up being awfully redundant if you are not clever. Your solution gets rid of this redundancy and is much simpler. – A.E Jun 26 '14 at 18:14
0

It seems like you are over complicating this. What I would do is take your digit set A, your string length n, and your threshold T and then solve the following optimization problem:

Minimize the sum of n elements of A (with repeats) such that the sum still exceeds
the threshold value T.

You can use the argmin of the result to then produce a multiset from which you can pull from without replacement to get any valid string. For example, any string with two 2's will have an average digit value exceeding your threshold, so any ordering of any three element subset of the multiset M = [1, 2, 2, 2] will be valid.

EDIT: Here is a way you could generate the minimum valid multisets. The partitionfunc definition was borrows from this SO post, then I just filter out those lists whose elements are all in digit_set. The reason min_sum gets the ceiling operation is because I am assuming the digits must be integers, thus their sum will be an integer. Thus in order to exceed the threshold, the value of the digit sum must be no less than ceil(num_digits * threshold). Hope this helps!

from math import ceil

def partitionfunc(n,k,l=1):
'''n is the integer to partition, 
   k is the length of partitions, 
   l is the min partition element size'''
if k < 1:
    raise StopIteration
if k == 1:
    if n >= l:
        yield (n,)
    raise StopIteration
for i in range(l,n+1):
    for result in partitionfunc(n-i,k-1,i):
        yield (i,)+result

def find_min_sets(num_digits, digit_set, threshold):
  min_sum = ceil(num_digits * threshold)
  min_sets = [l for l in partitionfunc(min_sum, num_digits) if
              all(map(lambda x: x in digit_set, l))]
  return min_sets
Community
  • 1
  • 1
A.E
  • 260
  • 2
  • 8
  • The idea of multisets is good, but there can be many different such minimal multisets: e.g. if digits = {1, 2, 3}, n = 2 and x = 1.9, then {3, 1} and {2, 2} are both possible answers from argmin. As n and/or the number of digits increases, the number of such minimal multisets grows quickly. – j_random_hacker Jun 25 '14 at 22:27
  • Good point. It would be better to find the minimum possible value `V` of the digit sum, and figure out all `n`-partitions of `V`, which sounds like a tractable dynamic programming problem (reminds me of "how many ways to make `N` dollars with the set `S` of currency denominations..) – A.E Jun 25 '14 at 22:34
  • 2
    With a couple of tweaks, this will be a fast and complete solution. First, you should call `partitionfunc` not only with `k=min_sum`, but in a loop from `min_sum` to `max_sum`, where `max_sum` is `num_digits * max_digit`: all of these partitions will be valid multisets, and distinct from each other :) – j_random_hacker Jun 26 '14 at 19:30
  • 1
    Second, to minimise the *delay* between results yielded by `partitionfunc`, you should bail out as early as possible with `if n > k * max_digit || n < k * l: raise StopIteration` (my Python syntax may be wrong). I *think* this will guarantee O(m) time (which is optimal) between results, provided that the set of allowed digits is a contiguous range starting at 1 and ending at some digit `max_digit`. (Without this "early" checking, `partitionfunc` might take time exponential in m to find the next valid partition: ... – j_random_hacker Jun 26 '14 at 19:37
  • ... e.g. suppose n=10, k=5 and digits = {1, 2}, then it will try 11111, 11112, 11121, 11122, 11211, 11212, 11221, 11222, ..., before finally hitting the valid multiset 22222.) – j_random_hacker Jun 26 '14 at 19:37