1

My task is to:

  1. generate a list of unique combinations where each combination has a certain length (var com_len) and contains values (ints) from the given list (var values),
  2. each combination is created by taking random values from the given list (randomness is very important!),
  3. each combination must have unique values inside, no value can be repeated inside the combination,
  4. values in the combination must be sorted,
  5. appearance of each value within the whole combinations' set is counted (var counter),
  6. each value must appear in the whole dataset as close to the given number of times as possible (var counter_expected). "as close as possible" means, count each appearing value as the script goes and if there are no more combinations left to create, just end the script.

For example, I need to generate a list of combinations where each combination has a length of 3, has unique sorted values inside, each value is from the range(256) and each value appears in all of the combinations generated so far as close to 100 times as possible.

My problem is, how efficiently detect there are no more unique combinations of the values that can be created to stop the loop.

The problem appears when the script is going to an end and there still are available values left and len(available_values) > com_len but it isn't possible to create any new unique combinations that haven't appeared yet.

The code created so far:

import numpy as np
import random

com_len = 3
length = 256
counter = np.zeros(length)
values = range(length)
exclude_values = []
counter_expected = 100
done = []

mask = np.ones(len(np.array(values)), np.bool)
mask[exclude_values] = False
available_values = set(values) - set(exclude_values)
available_values = list(available_values)

ii = 0
while True:
    """print progress"""
    ii = ii + 1
    if not ii % 1000: print('.', end='')

    #POSSIBLE CONDITION HERE
    ex = random.sample(available_values, k=com_len)
    ex.sort()
    if ex in done: continue
    done.append(ex)

    counter[ex] = counter[ex] + 1

    for l_ in ex:
        if counter[l_] == counter_expected:
            del available_values[available_values.index(l_)]

    if len(available_values) < com_len:break

    if all(counter[mask] == counter_expected): break
    #OR HERE

NOTE: The script usually ends successfully because either len(available_values) < com_len or all(counter[mask] == counter_expected) condition breaks the While loop. Try running the script several times and at some point, you'll observe the script going into an infinite loop as len(available_values) >= com_len but there are no more new unique conditions available to be created so the counter is not increasing.

I need an efficient condition to stop the script. Using itertools.combinations here is not an option because available_values list may be long, i.e. 10k elements, at the beginning.

The brute-force would be using itertools.combinations when len(available_values) reaches a certain level and checking if there are any combinations that haven't been created yet but it's an ugly solution.

There might be a better way which doesn't come up to me but may to you. I'll be grateful for your help.

  • This is vague. Why not give a specific complete example, with a parameter smaller than 256? – John Coleman Apr 30 '20 at 01:12
  • Why do you think it is vague? I described the problem as specific as possible :(. It's the whole problem with the working code presenting the problem which only appear from time to time. If the script ends successfully for you, that means that either if len(available_values) < com_len or if all(counter[mask] == counter_expected) breaks the script and ends it. Try running the script several times and you'll see that at some option it may run into an infinite loop when len(available_values) >= com_len but there are no more unique combinations that can be created. – Krzysztof Maliszewski Apr 30 '20 at 01:24
  • John, by description you mean body or the title? – Krzysztof Maliszewski Apr 30 '20 at 01:30
  • Sure, thank you for your patience. I edited the body, providing an example of the problem. I'm writing it here so you don't have to: I need to generate a list of combinations where each combination has a length of 3, has unique sorted values inside, each value is from the range of (0,255) and each value appears in all of the combinations generated as close to 100 times as possible. – Krzysztof Maliszewski Apr 30 '20 at 01:37
  • I hope that additional changes make the problem less vague. Is there anything else that you think I should explain? – Krzysztof Maliszewski Apr 30 '20 at 01:43
  • That is clearer, though there are probably multiple solutions. Seems like some sort of set covering problem where you are trying to cover a set with subsets of size 3 such that each point is in 100 of the subsets. – John Coleman Apr 30 '20 at 01:50
  • Needed to think about your answer, but... yes :). That's the problem but the "100" part since it may never be possible to achieve 100 and, therefore, the loop must end when there are no more subsets of size 3 to create and the latter is part is my main problem here. – Krzysztof Maliszewski Apr 30 '20 at 01:58
  • If it isn't possible to achieve but you want to get close, you need a measure of total closeness (e.g. least squares of the differences between observed and desired counts). It becomes some sort of combinatorial optimization problem, with the solution being perhaps NP-hard. It really strikes be as related to the constellation of ideas related to [set covering](https://en.wikipedia.org/wiki/Set_cover_problem) problems. – John Coleman Apr 30 '20 at 02:03
  • Added one more important thing which I forgot: values inside the combinations must be random but within the given list of values! Randomness is very important! – Krzysztof Maliszewski Apr 30 '20 at 02:04
  • Some sort of experimental design? – John Coleman Apr 30 '20 at 02:07
  • 1
    This seems relevant: https://en.wikipedia.org/wiki/Combinatorial_design – John Coleman Apr 30 '20 at 02:12
  • True but do you have a piece of code for, for example, t-wise balanced design? Otherwise, the sole theory is useless to me, unfortunately. Otherwise, I'll use brute-force and itertools.combinations for small sets of available_values which won't be time-consuming and will do the job but will be an ugly resolution. – Krzysztof Maliszewski Apr 30 '20 at 02:30
  • UPDATE: updated the code to a more efficient one but still not solving the problem. – Krzysztof Maliszewski Apr 30 '20 at 04:03

3 Answers3

1

UPDATED FINAL ANSWER:

I've come up with the following code that matches my needs. NOTES:

  • The function is not the best in many terms but it does its job very well!
  • The function has 3 modes of generation of data: generating a total number of combinations, generating combinations with a minimal number of times each value appear across all combinations, generating combinations with a "max" number of times each value appear across all combinations ("max" means "as close as possible to max value").
  • The function allows changing the length of combinations on the fly within a selected range or providing a specific number.
  • Depending on the params, the function can do a redundant number of iterations as shown by 'Total number of errors'. But...
  • It's FAST! It uses sets and tuples for great performance. The only problem could happen when itertools.combinations fires returning tons (millions) of combinations but, in my case, it has never happened so far.

The code:

import numpy as np
import random
import itertools
from decimal import Decimal

def get_random_combinations(values, exclude, length, mode, limit, min_ = 0, max_ = 0):
    done = set()

    try:
        """Creating counter"""
        counter = np.zeros(len(values), np.uint)

        """Create a mask for excluded values"""
        """https://stackoverflow.com/questions/25330959/how-to-select-inverse-of-indexes-of-a-numpy-array"""
        mask = np.ones(len(np.array(values)), np.bool)
        mask[exclude] = False

        """available values to create combinations"""
        values_a = set(values) - set(exclude)
        values_a = list(values_a)

        if length == 1: 
            if mode == 'total':
                """generate just data_number of examples"""
                for ii in range(limit):
                    comb = random.sample(values_a, 1)[0]
                    del values_a[values_a.index(comb)]
                    done.add(tuple([comb]))
            else:
                """generate one example for each comb"""
                for comb in values_a: done.add(tuple([comb]))
        else: 
            """total number of combinations"""
            if isinstance(length, str): rr = np.mean([min_, max_])
            else: rr = length
            nn = len(values_a)
            comb_max = int(Decimal(np.math.factorial(nn)) / Decimal(np.math.factorial(rr) * np.math.factorial(nn-rr)))
            err_limit = int(comb_max * 0.01)
            if err_limit > 10000: err_limit = 10000

            """initiate variables"""
            #should itertools be used to generate the rest of combinations
            gen_comb = False 
            #has all combinations generated by itertools ended
            comb_left_0 = False 
            #has the limit of errors been reached to generate itertools combinations
            err_limit_reached = False 
            #previous combination 
            ll_prev = 0 
            dd = 0 #done counter
            comb_left = set() #itertools  combinations
            err = 0 #errors counter

            """options variables for statistics"""
            err_t = 0 #total number of errors
            gen = 0 #total number of generations of itertools.combinations
            ii = 0 #total number of iterations

            print('GENERATING LIST OF COMBINATIONS')
            while True:
                """print progress"""
                ii = ii + 1
                if not dd % 1000: print('.', end='')

                """check if length of combs is random or not"""
                if isinstance(length, str): 
                    """change max_ length of combinations to 
                    \the length of available values"""
                    if len(values_a) < max_: max_ = len(values_a)
                    ll = random.randint(min_, max_)
                else: ll = length

                if ll != ll_prev: gen_comb = True

                """generate combinations only when err limit is reached or 
                the length of combinations has changed"""
                if err_limit_reached and gen_comb: 
                    gen = gen + 1
                    """after reaching the max number of consecutive errors, start generating combinations via itertools"""
                    """generation is done at this point to prevent generation for a very long list"""
                    """generation is also done when when length of a combination changes"""
                    comb_left = set(itertools.combinations(values_a, ll)) - done
                    """break if there are no elements left"""
                    if not len(comb_left): break

                """if combinations has already been generated, used them"""
                if comb_left:
                    """take random sample from the set"""
                    comb = random.sample(comb_left, 1)[0]
                    """remove it from the set"""
                    comb_left.remove(comb)
                    """check if it was the last combination to break the loop at the end"""
                    if not len(comb_left): comb_left_0 = True
                else: 
                    """generate random combination"""
                    comb = tuple(sorted(random.sample(values_a, ll)))

                """set previous length"""
                ll_prev = ll
                """reset gen_comb"""
                gen_comb = False

                """check if combination is new"""
                if comb not in done: found = True
                else:
                    """otherwise, iterate errors"""
                    err = err + 1
                    err_t = err_t + 1
                    found = False
                    if err > err_limit: err_limit_reached = gen_comb = True

                if found:
                    """reset err"""
                    err = 0
                    dd = dd + 1
                    """add combination to done"""    
                    done.add(comb)

                    """increase counter for the combs"""
                    counter[list(comb)] = counter[list(comb)] + 1

                    """check if seekeing the max number of combinations or min"""
                    if mode == 'max':
                        """for max, we must remove the elements which reached the limit"""
                        for l_ in list(comb):
                            if counter[l_] == limit:
                                del values_a[values_a.index(l_)]

                """if length of available elements is smaller than the possible length of the combinations"""
                if isinstance(length, str):
                    """for random length, choose the minimal length"""
                    if len(values_a) < min_: break
                else: 
                    if len(values_a) < ll: break
                """if all elements reached the limit"""
                if mode == 'total':
                    if len(done) >= limit: break
                else: #min, max
                    if all(counter[mask] >= limit): break
                """if the number of consecutive errors reached 
                the total number of combinations, break as you may never 
                draw a valid combination"""
                if err > comb_max: break
                """if it was the last combination left"""
                if comb_left_0: break
    except Exception as e: print(e)
    print('')

    print('Combinations generated: ' + str(dd))        
    print('Total number of iterations: ' + str(ii))
    print('Final value of err: ' + str(err))
    print('Total number of errors: ' + str(err_t))
    print('How many times itertools.combinations was used: ' + str(gen))

    return done

"""range of values to create combinations"""
values = range(256)
"""values excluded from the combinations"""
exclude = [0,255]
"""length of combinations, if is string, the number of layers 
is generated randomly withing (min_, max_) range """
length = 'r'
"""mode of how the combinations are generated:
min: minimal number of times the value appears across all combinations 
(limited down by the limit value)
max: max number of times the value appears across all combinations (limited         
max by the limit value)
total: total number of combinations  (limited the limit value)"""
mode = 'max' 
"""limit used for the mode combinations' generation"""
limit = 1000
"""min_ and max_ are used when length is string, 
length is generated randomly within (min_, max_) range"""
min_ = 4
max_ = 12
done = get_random_combinations(values, exclude, length, mode, limit, min_, max_)
0

There are a total of n choose k possible combinations meeting your criteria, with n = length and k = com_len. n choose k evaluates to n! / (k! * (n - k)!). If you generate all distinct possibilities, each of the n values appears (n - 1)! / ((k - 1)! * (n - k)!) times (https://math.stackexchange.com/q/26619/295281). You should be able to solve this, assuming that z <= (n - 1)! / ((k - 1)! * (n - k)!), where z = counter_expected.

For your example:

  • n = 256
  • k = 3
  • z = 100 <= 32385

One common method to generate combinations in general is to step k bits through an boolean array of length n, always incrementing the lowest possible bit. Whenever a higher bit gets incremented, all the ones below it get reset to their initial position. Here is a sample sequence:

0 0 0 0 3 2 1
0 0 0 3 0 2 1
0 0 0 3 2 0 1
0 0 0 3 2 1 0
0 0 3 0 0 2 1
...
3 2 0 0 1 0 0
3 2 0 1 0 0 0
3 2 1 0 0 0 0

I've labeled the positions so that you can see that if the values are sorted to begin with, the combinations will always come out sorted. Keep in mind that you can implement this as an array of n booleans or k indices. Both have advantages and disadvantages.

For your particular use-case, there's a twist. You don't use a bit once is count has exceeded a certain amount. There are a number of ways of stepping through the bits, but they all boil down to having a size n counter array.

If n * z is a multiple of k, you will automatically be able to get exact counts in all of the bins. Neither n nor z themselves actually has to be multiples of k. If that's not true, however, you will inevitably have underflow or overflow. Intuitively, you want to generate a target of n * z total values, k at a time. It's pretty clear that one has to be a multiple of the later to make this possible.

You can have two types of exit criteria. Given the total accumulated count of all the bits, s,

  1. s >= n * z: all bits have a count of at least z. At most k - 1 bits have a count of z + 1.
  2. s > n * z - k: all bits have a count of z, except at most k - 1 bits, so adding one more combination would cause condition 1.

One final design choice to discuss is the order in which bits move. Since generating a series of combinations exhausts a bin, I'd like to be able to keep the exhausted bins accumulating sequentially, in a predictable order on one side of the array bucket. This will remove a lot of checks from the algorithm. So instead of incrementing the lowest possible bit, I will increment the highest possible bit, and increment the one below it whenever it resets. In that case, the exhausted buckets will always be the lowest bits.

So let's finally stop making unproven mathy-sounding statements and show an implementation:

def generate_combos(n, k, z):
    full_locs = np.arange(k + 1, dtype=np.uint)
    full_locs[k] = n                      # makes partial vectorization easier
    locs = full_locs[:k]                  # bit indices
    counts = np.zeros(n, dtype=np.uint)   # counter buckets
    values = np.arange(n, dtype=np.uint)  # values
    min_index = 0                         # index of lowest non-exhausted bin

    for _ in range((n * z) // k):
        counts[locs] += 1
        yield values[locs]

        if counts[min_index] == z:
            # if lowest bin filled, shift and reset
            min_index += np.argmax(counts[min_index:] < z)
            locs[:] = min_index + np.arange(k)
        else:
            # otherwise, increment highest available counter
            i = np.flatnonzero(np.diff(full_locs) > 1)
            if i.size:
                i = i[-1]
                locs[i] += 1
                # reset the remainder
                locs[i + 1:] = locs[i] + np.arange(1, k - i)
            else:
                break

This uses condition 2. If you want condition 1, add the following lines after:

if counters[-1] < z:
    yield values[-k:]

Changing the loop to something like for _ in range(-((n * z) // -k)): (courtesy of https://stackoverflow.com/a/54585138/2988730) won't help because the counters aren't designed to handle it.

Here is an IDEOne link showing the first hundred elements of generate_combos(256, 3, 10):

[0 1 2]
[0 1 3]
[0 1 4]
[0 1 5]
[0 1 6]
[0 1 7]
[0 1 8]
[0 1 9]
[ 0  1 10]
[ 0  1 11]
[2 3 4]
[2 3 5]
[2 3 6]
[2 3 7]
[2 3 8]
[2 3 9]
[ 2  3 10]
[ 2  3 11]
[ 2  3 12]
[4 5 6]
[4 5 7]
[4 5 8]
[4 5 9]
[ 4  5 10]
[ 4  5 11]
[ 4  5 12]
[ 4  5 13]
[6 7 8]
[6 7 9]
[ 6  7 10]
[ 6  7 11]
[ 6  7 12]
[ 6  7 13]
[ 6  7 14]
[ 8  9 10]
[ 8  9 11]
[ 8  9 12]
[ 8  9 13]
[ 8  9 14]
[ 8  9 15]
[10 11 12]
[10 11 13]
[10 11 14]
[10 11 15]
[10 11 16]
[12 13 14]
[12 13 15]
[12 13 16]
[12 13 17]
[12 13 18]
[13 14 15]
[14 15 16]
[14 15 17]
[14 15 18]
[14 15 19]
[14 15 20]
[15 16 17]
[16 17 18]
[16 17 19]
[16 17 20]
[16 17 21]
[16 17 22]
[16 17 23]
[17 18 19]
[18 19 20]
[18 19 21]
[18 19 22]
[18 19 23]
[18 19 24]
[18 19 25]
[19 20 21]
[20 21 22]
[20 21 23]
[20 21 24]
[20 21 25]
[20 21 26]
[20 21 27]
[21 22 23]
[22 23 24]
[22 23 25]
[22 23 26]
[22 23 27]
[22 23 28]
[22 23 29]
[24 25 26]
[24 25 27]
[24 25 28]
[24 25 29]
[24 25 30]
[24 25 31]
[24 25 32]
[26 27 28]
[26 27 29]
[26 27 30]
[26 27 31]
[26 27 32]
[26 27 33]
[26 27 34]
[28 29 30]
[28 29 31]
...

Notice that after the first 10 elements, both 0 and 1 have appeared 10 times. 2 and 3 appeared once, so they get used up after only 9 more iterations, and so forth.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • Thank you, Mad Physicist. I appreciate your time and explanation. That's an interesting approach to this matter. Unfortunately, it doesn't solve the problem because of two issues. 1. your example fails because counts[locs] += 1 goes out of bounce, 2. it lacks the crucial part: generation of combinations must not follow any particular order, the values in combinations must be random because of the distribution of the combinations. I could generate all combinations with your code or itertools.combinations and then shuffle them but with n = 10000, k = 3, I wouldn't dare to calculate it! – Krzysztof Maliszewski Apr 30 '20 at 07:25
  • PS. Omit the part about the distribution of the combinations :). I shouldn't talk Math... – Krzysztof Maliszewski Apr 30 '20 at 07:32
  • @KrzysztofMaliszewski. It's about a third of a million if you want to visit each number 100 times. `(n * z) // k` – Mad Physicist Apr 30 '20 at 15:56
  • I'll fix the overrun problem shortly – Mad Physicist Apr 30 '20 at 15:57
  • Not sure what to do about the randomization. I'm sure it's feasible though – Mad Physicist Apr 30 '20 at 15:57
  • @KrzysztofMaliszewski. That being said, I am pretty sure I've come up with a reasonable way to randomize this thing using some bit twiddling and possibly a hash table. I'll post something if I can work out the details. This is an interesting problem. – Mad Physicist May 01 '20 at 02:18
  • I updated my answer with the final version of the code. – Krzysztof Maliszewski May 13 '20 at 04:40
0

Here is another answer focusing on the random aspect.

You have n choose k possibilities that you want to randomly sample to get approximately z occurrences of each value (using my notation from the other answer). I posit that if you take (n * z) // k k-sized samples, and your random number generator is actually uniform, you will automatically get approximately z occurrences of each element. In your example, with n=256, k=3 and z=100, it's plausible that among 8533, the distribution will indeed be fairly uniform among the 256 bins.

If you are willing to accept some level of imperfection in the uniformity, python's random.sample is a good choice. The population is all integers from zero to n choose k.

n choose k in this case is 256 * 255 * 254 / 6 = 2763520. This is just out of range for a signed 32-bit integer, but fits comfortably into an unsigned one. Better yet, you can simply use Python's infinite precision integers.

The trick is to map these numbers to a unique combination of values. This is done with a combinatorial number system, as described here.

from random import sample
from scipy.misc import combs

def gen_samples(n, k, z):
    codes = sample(range(combs(n, k)), (n * z) // k)
    return [unrank(n, k, code) for code in codes]

def unrank(n, k, i):
    """
    Implementation of Lehmer's greedy algorithm left as
    exercise for the reader
    """
    # return k-element sequence

See here for hints on unranking.

Mad Physicist
  • 107,652
  • 25
  • 181
  • 264