1

I am looking for a fast algorithm to solve the following problem in Python:

I have a set of boolean vectors of length M

B[1],B[2]...B[N] where length(B[i]) == M and B[i] in [0,1] for all i

I want to find all subsets of size K <= N that are almost mutually exclusive. That is, I want to enumerate the indices of vectors subset_idx such that:

  • len(subset_idx) <= K (subsets of at most K vectors)
  • sum(B[subset_idx), axis = 0) <= 1 at least M - T times (mutually exclusive)
  • sum(B[subset_idx), axis = 0) > 1 at most T times (not mutually exclusive)

Here:

  • T is a ``fudge factor" for the mutual exclusivity requirement. If T = 0, then the vectors in subset_idx must be mutually exclusive.

  • By mutually exclusive, I mean that that B[i,k] && B[j,k] = 0 for all vectors i, j in the subset_idx for almost all cases (i.e. M - T cases).

As an example, I included a sloppy implementation that uses numpy and itertools below. I think that both the algorithm and the implementation can be improved. I'm looking for a better algorithm, but any implementation-related pointers would be appreciated.

import numpy as np
import itertools

K = 3       #max size
T = 0       #fudge factor for mutual exclusivity requirement

B = np.array(
    [[1, 1, 0, 1, 0],
     [0, 0, 1, 1, 0],
     [1, 1, 0, 0, 1],
     [1, 1, 1, 1, 1],
     [0, 0, 0, 0, 0]]
)

N, M = B.shape

# if you want to generate vectors randomly, use:
#
#  np.random.seed(seed=1)
# N = 10      #number of vectors
# M = 20     #length of each vectors
# B = np.random.choice([0, 1], size = (N, M))


mutually_exclusive_subsets = []

#identify subsets that are almost mutually exclusive via exhaustive search
for subset_size in range(2, K):
    for subset_idx in itertools.combinations(range(N), subset_size):
        n_exclusive = np.sum(np.sum(B[subset_idx,], axis=0) <= 1)
        if n_exclusive >= (M - T):
            mutually_exclusive_subsets.append(subset_idx)

print mutually_exclusive_subsets
# >>> [(0, 4), (1, 2), (1, 4), (2, 4), (3, 4)]


#test that each subset identified is *almost* mutually exclusive
for exclusive_idx in mutually_exclusive_subsets:
    assert(np.sum(np.sum(B[exclusive_idx,], axis = 0) > 1) <= T)
Berk U.
  • 7,018
  • 6
  • 44
  • 69
  • 1
    The Hamming distance determines the number of different elements in (binary) strings and has implementations in numpy/scipy. Possible relevance: http://stackoverflow.com/questions/32730202 – bogus Mar 29 '17 at 15:22

0 Answers0