1

Problem Description:


I'm working on making a function which gives me a definition for a particular combination of several descriptors based on a single index. My inputs are a set of raw features X = [feat0,feat1,feat2,feat3,feat4], a list of powers to be used pow = [1,2,3], and a list of group sizes sizes = [1,3,5]. A valid output might look like the following:

  feat0^2 * feat4^3 * feat1^1

This output is valid because feat0, feat4, and feat1 exist within X, their powers exist within pow, and the number of features being combined is in sizes.

Invalid edge cases include:

  1. values which don't exist in X, powers not in pow, and combination sizes not in sizes
  2. combinations that are identical to another are invalid: feat0^2 * feat1^3 and feat1^3 * feat0^2 are the same
  3. combinations that include multiples of the same feature are invalid: feat0^1 * feat0^3 * feat2^2 is invalid

under the hood I'm encoding these groupings as lists of tuples. So feat0^2 * feat4^3 * feat1^1 would be represented as [(0,2), (4,3), (1,1)], where the first element in the tuple is the feature index, and the second is the power.

Question:


my question is, how can I create a 1 to 1 mapping of a particular combination to an index i? I would like to get the number of possible combinations, and be able to plug in an integer i to a function, and have that function generate a particular combination. Something like this:

X = [0.123, 0.111, 11, -5]
pow = [1,2,3]
sizes = [1,3]

#getting total number of combinations
numCombos = get_num_combos(X,pow,sizes)

#getting a random index corresponding to a grouping
i = random.randint(0, numCombos)

#getting grouping
grouping = generate_grouping(i, X, pow, sizes)

print(grouping)

Resulting in something like

[(0,1), (1,2), (3,1)]

So far, figuring out the generation when not accounting for the various edge cases wasn't too hard, but I'm at a loss for how to account for edge cases 2 and 3; making it guaranteed that no value of i is algebraically equivalent to any other value of i, and that the same feature does not appear multiple times in a grouping.

Current Progress


#computes the n choose k of a list and a size
def get_num_groupings(n, k):
    return int(math.factorial(n)/(math.factorial(k)*math.factorial(n-k)))

import numpy as np
import bisect

i = 150

n = 5
m = 3
sizes = [1, 3, 5]

#computing the number of elements in each group length
numElements = [m**k * get_num_groupings(n, k) for k in sizes]

#index bins for each group size
bins = list(np.cumsum(numElements))[:-1]

#getting the current group size
binIdx = bisect.bisect_left(bins,i)
curSize = sizes[binIdx]

#adding idx 0 to bins
bins = [0]+bins

#getting the location of i in the bin
z = i - bins[binIdx]

#getting the product index and combination rank
pi = z // m**k
ci = z % m**k

#getting the indexes of the powers
pidx = [(pi // m**(curSize - (num+1)))%m for num in range(curSize)]

#getting the indexes of the features
#TODO cidx = unrank(i, range(n))

This is based on the Mad Physicist's answer. Though I haven't figured out how to get cidx yet. Some of the variable names are rewritten for my own understanding. To my knowledge this implimentation works by logically separating the combinations of variables and which powers they each have. So far, I can get the powers from an index i, and once unrank is ironed out I should be able to get the indexes for which features are used.

Warlax56
  • 1,170
  • 5
  • 30

1 Answers1

2

Let's look at a slightly different problem that's closely related to what to want: generate all the possible valid combinations.

If you choose a size and a power, finding all possible combinations of features is fairly straightforward:

from itertools import combinations, product

n = len(X)
m = len(powers)
k = size = ...  # e.g. 3
pow = ...   # e.g. [1, 2, 3]

The iterator of unique combinations of features is given by

def elements(X, size, pow):
    for x in combinations(X, size):
        yield sum(e**p for p, e in zip(pow, x))

The equivalent one-liner would he

(sum(e**p for p, e in zip(pow, x)) for x in combinations(X, size))

This generator has exactly n choose k unique elements. These elements meet all your conditions by definition.

Now you can loop over all possible sizes and product of powers to get all the options:

def all_features(X, sizes, powers):
    for size in sizes:
        for pow in product(powers, repeat=size):
            for x in combinations(X, size):
                yield sum(e**p for p, e in zip(pow, x)) 

The total number of elements is the sum for each k of m**k * n choose k.

Now that you've counted the possibilities, you can compute the mapping of element to index and vice versa, using a combinatorial number system. Sample ranking and unranking functions for combinations are shown here. You can use them after you adjust the index for the size and power bins.

To show what I mean, assume you have three functions (given in the linked answer):

  1. choose(n, k) computes n choose k
  2. rank(combo) accepts the ordered indices of a specific commination and returns the rank.
  3. unrank(ind, k) accepts a rank and size, and returns the k indices of the corresponding combination.

You can then compute the offsets of each size group and the step for each power within that group. Let's work through your concrete example with n = 5, m = 3, and sizes = [1, 3, 5].

The number of elements for each size is given by

elements = [m**k * choose(n, k) for k in sizes]

The total number of possible arrangements is sum(elements):

3**1 * choose(5, 1) + 3**3 * choose(5, 3) + 3**5 * choose(5, 5) = 3 * 5 + 27 * 10 + 243 * 1 = 15 + 270 + 243 = 528

The cumulative sum is useful to convert between index and element:

cumsum = [0, 15, 285]

When you get an index, you can check which bin it falls in using bisect.

Let's say you were given index = 55. Since 15 < 55 < 285, your offset is 15, size = 3. Within the size = 3 group, you have an offset of z = 55 - 15 = 40.

Within the k = 3 group, there are m**k = 3**3 = 27 power products. The index of the product is pi = z // m**k and the combination rank is ci = z % m**k.

So the indices of the power are given by

pidx = [(pi // m**(k - 1)) % m, (pi // m**(k - 2)) % m, ...]

Similarly, the indices of the combination are given by

cidx = unrank(ci, k)

You can convert all these indices into a value using something like

sum(X[q]**powers[p] for p, q in zip(pidx, cidx))
Mad Physicist
  • 107,652
  • 25
  • 181
  • 264
  • By the way, I thought I should note, I've been trying to wrap my head around your response for what feels like hours, and every time I go to write a question you answer it before I get the chance. It must be the terminator eyes. Really though this answer is amazing – Warlax56 Oct 04 '20 at 05:17
  • 1
    @Warlax56. Sorry it was late at night where I'm at. The answer is a mess and I'll clean it up when I get the chance – Mad Physicist Oct 04 '20 at 13:02
  • I made some adjustments to my question. After looking through your response, I think it's critical that the implementation is "subscriptable". I'm working with an X of length 500, powers of length 3, and combo sizes of 1,2, and 3. Holding all results in memory is practically infeasible. therefore that idea of `grouping = generate_grouping(i, X, pow, sizes)` becomes critical – Warlax56 Oct 04 '20 at 17:33
  • * holding all results in memory is practically infeasible, and iterating through a generator to get to `i` seems slow – Warlax56 Oct 04 '20 at 17:40
  • @Warlax56. Using tuples as you do makes it just so much easier. I've shown you how to map an integer to the indices already. The only thing is that your size is a single integer, `ci` is shown as the indices into `X` and `pi` into `pow`. You don't need to list `size` explicitly because it's just the length of each index list. – Mad Physicist Oct 04 '20 at 18:00
  • Mad Physicist, I'm working through getting a preliminary implementation working. I checked the link, and I couldn't find a definition for the function `unrank(ind, k)`? Also, I'm doing it myself based on your description, but In working on this do you have a full implementation that you can share? – Warlax56 Oct 04 '20 at 21:08
  • 1
    @Warlax56. I'll get you something on Monday. The functions have different names in the linked question, but they are the same function. Play with them to determine which is which. – Mad Physicist Oct 04 '20 at 23:29
  • 1
    I [answered a previous question](https://stackoverflow.com/questions/18966230/index-into-size-ordered-power-set/18969320#18969320) that's sort of similar to the one you linked to. – Blckknght Oct 05 '20 at 00:12
  • Hey Mad Physicist. No pressure if you don't want to add to your answer, you've already been super helpful, but If you have the beginnings of a preliminary implementation would you mind sharing? – Warlax56 Oct 07 '20 at 05:21
  • 1
    @Warlax56. I'll do it when I can, but the linked question shows the three functions you need under slightly different names. I've already showed you the rest here. – Mad Physicist Oct 07 '20 at 11:18