4

I have big dictionary with frozenset keys. I need to find all keys which are subset of given one. I see obvious way to do it:

dictionary = {
    frozenset([1]): 1,
    frozenset([2]): 2,
    frozenset([3]): 3,
    frozenset([3, 4]): 34
}
biglist= [3, 4, 5]
results = {k: v for k, v in dictionary.items() if k.issubset(biglist)}
assert results == {frozenset([3]): 3, frozenset([3, 4]): 34}

But it is very slow for millions keys. Question is: is there any structure for fast searches of this type?

UPD: Basically, I don't want to iterate over all keys executing issubset on each one. Instead I can generate all posible sets from biglist and check if it in dictionary:

results = {}
maxkey = max(dictionary, key=len)
maxlen = len(dictionary[maxkey])
for lenght in range(1, maxlen):
    for subset in itertools.combinations(biglist, lenght):
        key = frozenset(subset)
        if key in dictionary:
            results[key] = dictionary[key]

But this method is also very expensive for long biglist.

demon.mhm
  • 469
  • 4
  • 18
  • https://stackoverflow.com/questions/728972/finding-all-the-subsets-of-a-set are you looking for a data structure in particular, or is an algorithm okay: does that answer your question? – en_Knight Aug 01 '18 at 14:38
  • Depending on the size of your `biglist`, iterating the elements in the dict might actualy be faster than checking all the subsets. Have you considered using a different data structure, e.g. something like a [Trie](https://en.wikipedia.org/wiki/Trie) instead? – tobias_k Aug 01 '18 at 14:40
  • python is a higher level program and is inherently a bit slow. Try using cython or try completely rewriting the code in c++. It might speed up the process in some cases it can speed stuff up to 500 times – BigZee Aug 01 '18 at 14:41
  • 4
    @ZararYounis That does not change the asymptotic complexity of the code. If generating all the subsets is exponential in Python, it's still so in C. – tobias_k Aug 01 '18 at 14:42
  • How long is your biglist? How do things improve if you make `biglist` into a set before using it in your dictcomp? – DSM Aug 01 '18 at 14:47
  • What characteristics have the keys? There are just few items in each key? Maybe it will be good sort all the keys, store the maximum value in each key and compare `issubset()` only for valid keys. – Andrej Kesely Aug 01 '18 at 14:56

4 Answers4

3

Depending on the size of your dictionary, and the length of your key, neither checking all the keys in the dict nor enumerating all the subsets and checking those is a good solution. Instead, you could restructure your "flat" dictionary to something like a Trie, or Prefix Tree. Here, each element in the set will point to another branch of the tree and/or an actual value:

dictionary = {
    frozenset([1]): 1,
    frozenset([2]): 2,
    frozenset([3]): 3,
    frozenset([3, 4]): 34
}

def totree(d):
    tree = {}
    for key in d:
        t = tree
        for x in sorted(key):
            t = t.setdefault(x, {})
        t["value"] = d[key]
    return tree

tree = totree(dictionary)
# {1: {'value': 1}, 2: {'value': 2}, 3: {'value': 3, 4: {'value': 34}}}

Now, you can recursively check those trees and yield each key that has a value. Other than enumerating all the subsets, this will only expand those branches where all the elements so far are in the tree.

def check_subsets(tree, key, prefix=[]):
    if "value" in tree:
        yield prefix, tree["value"]
    for i, x in enumerate(key):
        if x in tree:
            yield from check_subsets(tree[x], key[i+1:], prefix+[x])

biglist= [3, 4, 5]
res = list(check_subsets(tree, sorted(biglist)))
# [([3], 3), ([3, 4], 34)]

Note that it's important that both the keys in the tree and the key for lookup are added/checked in sorted order, otherwise relevant subtrees could be missed.

Addendum 1: This should be clear, but just to make sure: Of course this approach will not help if you construct the tree anew for each lookup, otherwise you can just as well do a linear scan of all the keys. Instead, you'd have to create the tree once, and then reuse it for multiple lookups, and possibly update it with new elements added to the set.

Addendum 2: Instead of "value", you can of course use any key for the actual value at that node in the prefix tree. You could use None, or a very long string or large random number guaranteed not to be an element in any of your key-sets. You could, with a few adaptions to the totree and check_subtree functions, also define a custom Tree class...

class Tree:
    def __init__(self, value=None, children=None):
        self.value = value
        self.children = children or {}
    def __repr__(self):
        return "Tree(%r, %r)" % (self.value, self.children)

... but IMHO just using nested dictionaries with some special value keys is simpler.

tobias_k
  • 81,265
  • 12
  • 120
  • 179
  • I think a prefix style approach is the right one here, but we can do a little better than a vanilla prefix tree. I think we want to build a "subset" tree where each node is the child of (one of) it's longest subset which is also in the tree. That way the tree is more compact, for example "abc" and "acd" can both be children of "ac". – Bi Rico Aug 01 '18 at 17:30
  • Maybe, but then both the construction and the lookup in the tree would be more difficult. You can't just check "is the next element in the tree?" but "is the next, or maybe the next two, or the next n elements in the tree?" The tree would be more compact, but (unless I'm missing something here) it kind of defeats the prefix-nature of the tree. – tobias_k Aug 01 '18 at 18:33
  • I've tested this solution and it works like a charm. Fast index creating and fast searching. Only problem I've found was using "value" as a key, because it's potentially can be a set member. – demon.mhm Aug 02 '18 at 12:03
  • @demon.mhm Glad it worked. About the values: you can use any (hashable) object as the "special" key, including e.g. `None`, or a string with a thousand characters. You could also create a separate `TreeNode` class with separate attributes for the value and the children, instead of using a dict for both. – tobias_k Aug 02 '18 at 12:11
  • @tobias_k Thanks, for mentioning it. I wonder if there is any way to assign last node actual value instead of dictionary so one can check known type instead of "value" or other hashable. Cant figure out how to get it from setdefault. Mybe track last node from key and assign value to last key instead of doing setdefault on it? – demon.mhm Aug 02 '18 at 23:37
  • 1
    @demon.mhm Not sure what you mean. You can't just add the value as the value of the prefix-key, as a prefix-key can have both a value and a sub-tree, as in the case of `3`. – tobias_k Aug 03 '18 at 08:21
1

This answer is loosely based on the idea of prefix trees but it comes at the problem from a slightly different tact. Basically we want to figure out how we can avoid touching the entire search space when we enumerate all subsets by using some kind of early stoping.

If we arrange our data into a "SubsetTree" such that all the children of a node are supersets of that node, we can stop exploring tree whenever we reach a node that isn't a subset of our current query because we know all of it's children will also not be subsets. When we build the tree, we want to prefer long parents over short parents because that'll increase the amount of early stopping in our search.

If you put all this together it looks something like this:

class SubsetTree:
    def __init__(self, key):
        self.key = key
        self.children = []

    def longestSubset(self, query):
        if not self.key.issubset(query):
            return None
        more = (x.longestSubset(query) for x in self.children)
        more = filter(lambda i: i is not None, more)
        return max(more, key=lambda x: len(x.key), default=self)

    def allSubsets(self, query):
        if not self.key.issubset(query):
            return
        if len(self.key) > 0:
            yield self.key
        for c in self.children:
            yield from c.allSubsets(query)


def buildSubtree(sets):
    sets = sorted(sets, key=lambda x: len(x))
    tree = SubsetTree(frozenset())
    for s in sets:
        node = SubsetTree(s)
        tree.longestSubset(s).children.append(node)
    return tree

dictionary = {
    frozenset([1]): 1,
    frozenset([2]): 2,
    frozenset([3]): 3,
    frozenset([3, 4]): 34
}
biglist= [3, 4, 5]

subsetTree = buildSubtree(dictionary.keys())
allSubsets = subsetTree.allSubsets(set(biglist))
results = {k: dictionary[k] for k in allSubsets}
assert results == {frozenset([3]): 3, frozenset([3, 4]): 34}
Bi Rico
  • 25,283
  • 3
  • 52
  • 75
  • Looks solid. I'd mark it as solution but testing it on 1.5M 2-20 lenght keys gives me approx to 10 hours of calculating and this time still grows from 50K keys (3%) done. I'll try to search any optimizations to complement this answer – demon.mhm Aug 02 '18 at 08:32
  • Interesting, but would it be possible to update the tree with a new key at a later point in time? It seems like the keys have to be added in order by length. But then, I guess the worst that could happen is that there are redundant subsets in the tree. Would be interesting to have a larger test set for some performance comparisons. – tobias_k Aug 02 '18 at 11:01
0

How about encoding both the frozenset keys and the given set into binary code?

And then you can bit-and the codes of the frozenset keys and the given set, if this result is equal to the frozenset keys' binary code, then the key is the subset of the given set.

This way precomputed the given set, I think it will make it faster.


In this case:

dictionary = {
    frozenset([1]): 1, # a
    frozenset([2]): 2, # b
    frozenset([3]): 3, # c
    frozenset([3, 4]): 34 # d
}
biglist= [3, 4, 5] # z

a: 10000
b: 01000
c: 00100
d: 00110
z: 00111

a & z = 00000 != a = 10000, no
b & z = 00000 != b = 01000, no
c & z = 00100 == c = 00100, yes
d & z = 00110 == d = 00110, yes
whfuyn
  • 153
  • 5
  • Binary '&' will give us not subset but sets intersection. Also it is still need to scan through all keys to calculate. Basicaly your suggestion is to build bit matrix of vocabulary entrances and query it by dot-product with vector of same vocabulary from queryset. Which again gives us only intersections we have to compare again with query to get subsets only. – demon.mhm Sep 10 '18 at 12:21
0

Being a bit generic here I think you can use a data structure called Bloom Filter to quickly discard what is definitely not in the set. Later you can do a simple scan with the possible candidates, hopefully, this last step is just a small fraction of your set.

Here is a Python implementation for the bloom filter: https://github.com/axiak/pybloomfiltermmap

Quoting their example:

>>> fruit = pybloomfilter.BloomFilter(100000, 0.1, '/tmp/words.bloom')
>>> fruit.update(('apple', 'pear', 'orange', 'apple'))
>>> len(fruit)
3
>>> 'mike' in fruit
False
>>> 'apple' in fruit
True
Leo
  • 1,102
  • 2
  • 11
  • 18
  • I don't think a bloom filter works here, bloom filters are good for checking equality, they don't help much for checking `issubset`. – Bi Rico Aug 01 '18 at 15:38
  • I agree it is not a good approach for this problem but definitely implements a set operation to test when a given entry is a member of a set. The first line of the referenced Wikipedia article tells you. – Leo Aug 02 '18 at 13:14