I am looking for a fast algorithm to solve the following problem in Python:
I have a set of boolean vectors of length M
B[1],B[2]...B[N]
where length(B[i]) == M
and B[i] in [0,1]
for all i
I want to find all subsets of size K <= N
that are almost mutually exclusive. That is, I want to enumerate the indices of vectors subset_idx
such that:
len(subset_idx) <= K
(subsets of at mostK
vectors)sum(B[subset_idx), axis = 0) <= 1
at leastM - T
times (mutually exclusive)sum(B[subset_idx), axis = 0) > 1
at mostT
times (not mutually exclusive)
Here:
T
is a ``fudge factor" for the mutual exclusivity requirement. IfT = 0
, then the vectors insubset_idx
must be mutually exclusive.By mutually exclusive, I mean that that
B[i,k] && B[j,k] = 0
for all vectorsi, j
in thesubset_idx
for almost all cases (i.e.M - T
cases).
As an example, I included a sloppy implementation that uses numpy
and itertools
below. I think that both the algorithm and the implementation can be improved. I'm looking for a better algorithm, but any implementation-related pointers would be appreciated.
import numpy as np
import itertools
K = 3 #max size
T = 0 #fudge factor for mutual exclusivity requirement
B = np.array(
[[1, 1, 0, 1, 0],
[0, 0, 1, 1, 0],
[1, 1, 0, 0, 1],
[1, 1, 1, 1, 1],
[0, 0, 0, 0, 0]]
)
N, M = B.shape
# if you want to generate vectors randomly, use:
#
# np.random.seed(seed=1)
# N = 10 #number of vectors
# M = 20 #length of each vectors
# B = np.random.choice([0, 1], size = (N, M))
mutually_exclusive_subsets = []
#identify subsets that are almost mutually exclusive via exhaustive search
for subset_size in range(2, K):
for subset_idx in itertools.combinations(range(N), subset_size):
n_exclusive = np.sum(np.sum(B[subset_idx,], axis=0) <= 1)
if n_exclusive >= (M - T):
mutually_exclusive_subsets.append(subset_idx)
print mutually_exclusive_subsets
# >>> [(0, 4), (1, 2), (1, 4), (2, 4), (3, 4)]
#test that each subset identified is *almost* mutually exclusive
for exclusive_idx in mutually_exclusive_subsets:
assert(np.sum(np.sum(B[exclusive_idx,], axis = 0) > 1) <= T)