21

The problem

I am given N arrays of C booleans. I want to organize these into a datastructure that allows me to do the following operation as fast as possible: Given a new array, return true if this array is a "superset" of any of the stored arrays. With superset I mean this: A is a superset of B if A[i] is true for every i where B[i] is true. If B[i] is false, then A[i] can be anything.

Or, in terms of sets instead of arrays:

Store N sets (each with C possible elements) into a datastructure so you can quickly look up if a given set is a superset of any of the stored sets.

Building the datastructure can take as long as possible, but the lookup should be as efficient as possible, and the datastructure can't take too much space.

Some context

I think this is an interesting problem on its own, but for the thing I'm really trying to solve, you can assume the following:

  • N = 10000
  • C = 1000
  • The stored arrays are sparse
  • The looked up arrays are random (so not sparse)

What I've come up with so far

  1. For O(NC) lookup: Just iterate all the arrays. This is just too slow though.

  2. For O(C) lookup: I had a long description here, but as Amit pointed out in the comments, it was basically a BDD. While this has great lookup speed, it has an exponential number of nodes. With N and C so large, this takes too much space.

I hope that in between this O(N*C) and O(C) solution, there's maybe a O(log(N)*C) solution that doesn't require an exponential amount of space.

EDIT: A new idea I've come up with

  • For O(sqrt(N)C) lookup: Store the arrays as a prefix trie. When looking up an array A, go to the appropriate subtree if A[i]=0, but visit both subtrees if A[i]=1.

    My intuition tells me that this should make the (average) complexity of the lookup O(sqrt(N)C), if you assume that the stored arrays are random. But: 1. they're not, the arrays are sparse. And 2. it's only intuition, I can't prove it.

I will try out both this new idea and the BDD method, and see which of the 2 work out best.

But in the meantime, doesn't this problem occur more often? Doesn't it have a name? Hasn't there been previous research? It really feels like I'm reinventing the wheel here.

Migi
  • 1,112
  • 9
  • 14
  • If the arrays are static, you could precompute the is_subset(a,b) for every possible pair(a,b). Would require a 10K*10K bitmap := 100M/CHAR_BIT. Lookup would be O(1), obviously. – wildplasser Feb 19 '12 at 21:07
  • @wildplasser: The array that is looked up isn't one of the N stored sets. (otherwise the result would always be true). So for lookup, I don't see how I could use this bitmap. Also, none of the stored sets will be a subset of one another, because if A is a subset of B, I can just as well only store A. – Migi Feb 19 '12 at 21:55
  • 1
    Sorry, I read too fast. With some negated logic, and extending the bitset beyond 64 bits, maybe this(which is very similar to a BDD) can be adapted: http://stackoverflow.com/questions/9246017/efficiently-find-the-first-element-matching-a-bit-mask/9295393#9295393 – wildplasser Feb 19 '12 at 22:09
  • 1
    Your approach is actually very similar to [BDD](http://en.wikipedia.org/wiki/Binary_decision_diagram). At the very least knowing such a think already exist can save you implementation time. Note that BDD's are usually used for general formulas up to a few hundrades variables. – amit Feb 19 '12 at 22:23
  • 2
    Since the required outcome is supposed to be a boolean, this could indeed be implemented as a BDD. The variable-ordering could probably be guided by putting the bits (in C) having frequency (in sum(N)) closest to 1/2 close to the top. – wildplasser Feb 20 '12 at 22:59
  • What does the result of ORing all initial arrays look like? What's the ratio of 0's and 1's? – Philip Feb 21 '12 at 10:10
  • @Philip: ORing all initial arrays will give all 1 (or very close to it). ANDing all initial arrays will give all 0. – Migi Feb 21 '12 at 11:57
  • And the ratio 0's to 1's is about 20:1 in the stored arrays, and about 2:1 in the queried arrays. – Migi Feb 21 '12 at 13:01
  • The ratios suggest that you should sort the stored arrays by number of ones: if almost all stored arrays have only C/21 ones and you typically need C/3 ones to have a match, then you can typically skip almost all stored arrays. It's a heuristic only, but from your description it sounds like a good one. – Erik P. Feb 21 '12 at 16:51

4 Answers4

9

Just to add some background information to the prefix trie solution, recently I found the following paper:

I.Savnik: Index data structure for fast subset and superset queries. CD-ARES, IFIP LNCS, 2013.

The paper proposes the set-trie data structure (container) which provides support for efficient storage and querying of sets of sets using the trie data structure, supporting operations like finding all the supersets/subsets of a given set from a collection of sets.

For any python users interested in an actual implementation, I came up with a python3 package based partly on the above paper. It contains a trie-based container of sets and also a mapping container where the keys are sets. You can find it on github.

Lrrr
  • 4,755
  • 5
  • 41
  • 63
mmihaltz
  • 101
  • 1
  • 5
3

I think prefix trie is a great start.

Since yours arrays are sparse, I would additionally test them in bulk. If (B1 ∪ B2) ⊂ A, both are included. So the idea is to OR-pack arrays by pairs, and to reiterate until there is only one "root" array (it would take only twice as much space). It allows to answer 'Yes' to your question earlier, which is mainly useful if you don't need to know with array is actually contained.

Independently, you can apply for each array a hash function preserving ordering.

Ie : B ⊂ A ⇒ h(B) ≺ h(A)

ORing bits together is such a function, but you can also count each 1-bit in adequate partitions of the array. Here, you can eliminate candidates faster (answering 'No' for a particular array).

YvesgereY
  • 3,778
  • 1
  • 20
  • 19
2

You can simplify the problem by first reducing your list of sets to "minimal" sets: keep only those sets which are not supersets of any other ones. The problem remains the same because if some input set A is a superset of some set B you removed, then it is also a superset of at least one "minimal" subset C of B which was not removed. The advantage of doing this is that you tend to eliminate large sets, which makes the problem less expensive.

From there I would use some kind of ID3 or C4.5 algorithm.

mitchus
  • 4,677
  • 3
  • 35
  • 70
0

Building on the trie solution and the paper mentioned by @mmihaltz, it is also possible to implement a method to find subsets by using already existing efficient trie implementations for python. Below I use the package datrie. The only downside is that the keys must be converted to strings, which can be done with "".join(chr(i) for i in myset). This, however, limits the range of elements to about 110000.

from datrie import BaseTrie, BaseState

def existsSubset(trie, setarr, trieState=None):

    if trieState is None:
        trieState = BaseState(trie)

    trieState2 = BaseState(trie)
    trieState.copy_to(trieState2)
    for i, elem in enumerate(setarr):
        if trieState2.walk(elem):
            if trieState2.is_terminal() or existsSubset(trie, setarr[i:], trieState2): 
                return True
            trieState.copy_to(trieState2)
    return False

The trie can be used like dictionary, but the range of possible elements has to be provided at the beginning:

alphabet = "".join(chr(i) for i in range(100))
trie = BaseTrie(alphabet)

for subset in sets:
   trie["".join(chr(i) for i in subset)] = 0 # the assigned value does not matter

Note that the trie implementation above works only with keys larger than (and not equal to) 0. Otherwise, the integer to character mapping does not work properly. This problem can be solved with an index shift.

A cython implementation that also covers the conversion of elements can be found here.

Samufi
  • 2,465
  • 3
  • 19
  • 43