0

For numpy how can I efficiently create

  1. an array/matrix representing a list of all combinations (k out of n) as lists of k indices. The shape would be (binomial(n, k), k).

  2. a sparse array/matrix representing this combinations as bitmasks of length n. (So expanding aboves indices to bitmask.) The shape would be (binomial(n, k), n).

I need to do this with large n (and maybe small k). So the algorithm should be

  1. time efficient (e.g. maybe allocate complete result space at once before filling it?)

  2. space efficient (e.g. sparse bitmasks)

Many Thanks for your help.

sascha
  • 32,238
  • 6
  • 68
  • 110
saluto
  • 77
  • 7
  • 1
    Have you tried something? `binomial(n, k)` is a factorial, so no matter how fast you try to do this it will take a lot of time and space if the numbers are somewhat big... – jdehesa Apr 16 '18 at 16:22

1 Answers1

1

Assuming the blowup is not that bad (as mentioned in the comment above), you might try this. It's pretty vectorized and should be fast (for cases which could be handled).

Edit: i somewhat assumed you are interested in an output based on scipy.sparse. Maybe you are not.

Code

import itertools
import numpy as np
import scipy.sparse as sp

def combs(a, r):
    """
    Return successive r-length combinations of elements in the array a.
    Should produce the same output as array(list(combinations(a, r))), but
    faster.
    """
    a = np.asarray(a)
    dt = np.dtype([('', a.dtype)]*r)
    b = np.fromiter(itertools.combinations(a, r), dt)
    b_ = b.view(a.dtype).reshape(-1, r)
    return b_

def sparse_combs(k, n):
    combs_ = combs(np.arange(n), k)
    n_bin = combs_.shape[0]

    spmat = sp.coo_matrix(( np.ones(n_bin*k),
                            (np.repeat(np.arange(n_bin), k),
                             combs_.ravel()) ),
                            shape=(n_bin, n))
    return spmat


print('dense')
print(combs(range(4), 3))
print('sparse (dense for print)')
print(sparse_combs(3, 4).todense())

Output

dense
[[0 1 2]
 [0 1 3]
 [0 2 3]
 [1 2 3]]
sparse (dense for print)
[[ 1.  1.  1.  0.]
 [ 1.  1.  0.  1.]
 [ 1.  0.  1.  1.]
 [ 0.  1.  1.  1.]]

The helper-function combs i took (probably) from this question (sometime in the past).

Small (unscientific) timing:

from time import perf_counter as pc
start = pc()
spmat = sparse_combs(5, 50)
time_used = pc() - start
print('secs: ', time_used)
print('nnzs: ', spmat.nnz)

#secs:  0.5770790778094155
#nnzs:  10593800

(3, 500)
#secs:  3.4843752405405497
#nnzs:  62125500
sascha
  • 32,238
  • 6
  • 68
  • 110
  • 1
    Thanks. Using `np.full(n_bin*k, True)` instead of `np.ones(n_bin*k)` gives a boolean result and is slightly faster. Apart from this I haven't found anything faster so far. – saluto Apr 16 '18 at 19:16
  • @saluto I think i would prefer ```np.ones(n_bin*k, dtype=bool)``` to the ```np.full()``` approach (feels more natural to me). – sascha Apr 16 '18 at 19:24
  • Yes, maybe it's more natural, but `np.full` is faster. – saluto Apr 16 '18 at 19:28
  • Interesting. Well, trust your measurements! – sascha Apr 16 '18 at 19:28
  • And for `k = 2` and `n >= 12` replacing `combs(...)` with `np.asarray(np.triu_indices(n, 1)).T` gives further performance improvement. – saluto Apr 16 '18 at 22:31