Set Cover or Hitting Set; Numpy, Least element combinations to make up full set

Question

My goal is to find the least possible number of sub sets [a-f] to make up the full set A.

A = set([1,2,3,4,5,6,7,8,9,10]) # full set


#--- below are sub sets of A ---

a = set([1,2])
b = set([1,2,3])
c = set([1,2,3,4])
d = set([4,5,6,7])
e = set([7,8,9])
f = set([5,8,9,10])

In reality the parent set A that I'm dealing with contains 15k unique elements, with 30k subsets and these sub sets range in lengths from a single unique element to 1.5k unique elements.

As of now the code I'm using looks more or less like the following and is PAINFULLY slow:

import random


B = {'a': a, 'b': b, 'c': c, 'd': d, 'e': e, 'f': f}
Bx = B.keys()
random.shuffle(Bx)

Dict = {}

for i in Bx: # iterate through shuffled keys.
    z = [i]
    x = B[i]
    L = len(x)

    while L < len(A):
        for ii in Bx:
            x = x | B[ii]
            Lx = len(x)
            if Lx > L:
                L = Lx
                z.append(ii)

    try:
        Dict[len(z)].append(z)
    except KeyError:
        Dict[len(z)] = [z]

print Dict[min(Dict.keys()]

This is just to give an idea of the approach I've taken. For clarity I've left out some logic that minimizes iterations on sets that are already too large and other things like that.

I imagine Numpy would be REALLY good at this type of problem but I can't think of a way to use it.

The elements on the "sub sets" need not be in the `A` set right? Or is it guaranteed that the union of all subsets will be a subset of `A`? — Bakuriu, Feb 23 '14 at 19:20
Every item in a subset will be a member of set A. And it is guaranteed that the union of all subsets will be exactly equal to set A. — Ryder Brooks, Feb 23 '14 at 19:23
I currently have 30,046 subsets. But that number will grow as I collect more data. I can't imagine there ever being more than 50k subsets though. — Ryder Brooks, Feb 23 '14 at 19:30
And just to be certain, do any "subsets" have elements not in the superset? — Russia Must Remove Putin, Feb 23 '14 at 19:38
Its probably unnecessary considering its a dictionary but the idea was to randomize the subsets. Because they were created some what sequentially. By that I mean: foo_a = set([1,2,3]). foo_b= set([1,2,3,4])... foo_f = set([5,6,7]). So sets[a-c] contain many but not all of the same elements. but sets[d-f] are more likely to contain elements not in sets[a-c]. — Ryder Brooks, Feb 23 '14 at 19:44
Do you want just *one* set of subsets that meet the requirement or you want to find *all* possible solutions? AFAIK your code finds *all* possible solutions. — Bakuriu, Feb 23 '14 at 20:44
You do know this is the Set Cover Problem right? http://en.wikipedia.org/wiki/Set_cover_problem - so there is no known fast, accurate algorithm — bazzargh, Feb 23 '14 at 20:51
It's only necessary that I know one. The full version of the code that I posted above clears the dictionary if len(z) < min(Dict.keys()). but that's one of the little logic things that I said I left out for clarity's sake. — Ryder Brooks, Feb 23 '14 at 20:52
Thank you bazzargh for identifying this as the Set Cover Problem. While it may be true that there is no known fast algorithm, I doubt that the loop I've provided above is the best use of python to attack the problem. And considering it's a common enough problem that it has it's own name.. I can imagine that an optimized python approach would be very useful to a lot of people. — Ryder Brooks, Feb 23 '14 at 21:11
CT Zhu you are correct. I've edited my post to reflect your observation. Thank you — Ryder Brooks, Feb 23 '14 at 23:04
If nearby sets are likely to contain many of the same elements, why not take advantage of that to pre-filter the data? Can you remove some of the input sets? Those 1-element sets should be particularly easy - either you can throw them away immediately, or you have to include them in the solution. — user2357112, Feb 23 '14 at 23:11
I'm pretty sure your algorithm doesn't work. After the randomization, it iterates through the sets in the same order every time, greedily picking any set that includes elements not yet covered. The only difference between iterations is what subset you choose to initialize the set with. It will fail hard on many reasonable inputs. — user2357112, Feb 23 '14 at 23:31
I agree that a more intelligent implementation is to iterate through the longest unique subsets and accept values that are unique to the set being constructed if a certain percentage of those values are new to the set being constructed. That's the approach I'm using in the real version of this code. For discussions sake I thought it best to keep the example code as readable and pure to the essential challenge as possible. My hope was that someone was going to come up with a completely different way of looking at the problem. — Ryder Brooks, Feb 23 '14 at 23:56
I found it made a lot of sense to create a data structure using the parameters given, so I did before constructing a solution. Please see my answer below, I think it works fairly well. — Russia Must Remove Putin, Feb 24 '14 at 06:30

score 3 · Accepted Answer · edited May 23 '17 at 12:06

The question is asking for an implementation of the Set Cover Problem, for which no fast algorithm exists to find the optimal solution. However, the greedy solution to the problem - to repeatedly choose the subset that contains the most elements that have not been covered yet - does a good job in reasonable time.

You can find an implementation of that algorithm in python at this previous question

Edited to add: @Aaron Hall's answer can be improved by using the below drop-in replacement for his greedy_set_cover routine. In Aaron's code, we calculate a score len(s-result_set) for every remaining subset each time we want to add a subset to the cover. However, this score will only decrease as we add to the result_set; so if on the current iteration we have picked out a best-so-far set with a score higher than the remaining subsets achieved in previous iterations, we know their score cannot improve and can just ignore them. That suggests using a priority queue to store the subsets for processing; in python we can implement the idea with heapq:

# at top of file
import heapq
#... etc

# replace greedy_set_cover
@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    max = len(parent_set)
    # create the initial heap. Note 'subsets' can be unsorted,
    # so this is independent of whether remove_redunant_subsets is used.
    heap = []
    for s in subsets:
        # Python's heapq lets you pop the *smallest* value, so we
        # want to use max-len(s) as a score, not len(s).
        # len(heap) is just proving a unique number to each subset,
        # used to tiebreak equal scores.
        heapq.heappush(heap, [max-len(s), len(heap), s])
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        best = []
        unused = []
        while heap:
            score, count, s = heapq.heappop(heap)
            if not best:
                best = [max-len(s - result_set), count, s]
                continue
            if score >= best[0]:
                # because subset scores only get worse as the resultset
                # gets bigger, we know that the rest of the heap cannot beat
                # the best score. So push the subset back on the heap, and
                # stop this iteration.
                heapq.heappush(heap, [score, count, s])
                break
            score = max-len(s - result_set)
            if score >= best[0]:
                unused.append([score, count, s])
            else:
                unused.append(best)
                best = [score, count, s]
        add_set = best[2]
        logging.debug('len of add_set is {0} score was {1}'.format(len(add_set), best[0]))
        results.append(add_set)
        result_set.update(add_set)
        # subsets that were not the best get put back on the heap for next time.
        while unused:
            heapq.heappush(heap, unused.pop())
    return results

For comparison, here's the timings for Aaron's code on my laptop. I dropped remove_redundant_subsets as when we use the heap, dominated subsets are never reprocessed anyway:

INFO:root:make_subsets function took 15800.697 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 463.478 ms
INFO:root:greedy_set_cover function took 32662.359 ms
INFO:root:len of results is 46

And here's the timing with the code above; a bit over 3 times faster.

INFO:root:make_subsets function took 15674.409 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 461.027 ms
INFO:root:greedy_pq_set_cover function took 8896.885 ms
INFO:root:len of results is 46

NB: these two algorithms process the subsets in different orders and will occasionally give different answers for the size of the set cover; this is just down to 'lucky' choices of subsets when scores are tied.

The priority queue/heap is a well known optimisation of the greedy algorithm, though I couldn't find a decent discussion of this to link to.

While the greedy algorithm is a quick-ish way to get an approximate answer, you can improve on the answer by spending time afterwards, knowing that we have an upper bound on the minimal set cover. Techniques to do so include simulated annealing, or branch-and-bound algorithms as illustrated in this article by Peter Norvig

Russia Must Remove Putin · Answer 2 · 2014-02-24T17:21:35.873

Thanks for the question, I found it very interesting. I have tested the below code on Python 2.6, 2.7, and 3.3, you may find it interesting to run it yourself, I made it easy to paste into an interpreter or to run as a script.

Another solution here attempts to solve by brute force, i.e. going through every possible combination, which may be doable for ten elements, which the questioner gave as an example, but won't offer a solution for the parameters the questioner asked for, i.e., choosing a combination of subsets (up to 1500 elements long, from a superset of 15000 elements) from a set of 30,000 sets. I found for those parameters, attempting to find a solution set where n = 40 (very unlikely) would mean searching many orders of combinations over one googol, which is quite impossible.

The Setup

Here I import some modules used to benchmark my functions and create the data. I also create a timer decorator to wrap the functions so I can easily measure how much time passes before the function completes (or I give up and interrupt the function).

import functools
import time
import logging
import random

# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000

def timer(f):
    '''
    timer wrapper modified, original obtained from:
    http://stackoverflow.com/questions/5478351/python-time-measure-function
    '''
    @functools.wraps(f)
    def wrap(*args):
        time1 = time.time()
        try:
            ret = f(*args)
        except KeyboardInterrupt:
            time2 = time.time()
            logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        else:
            time2 = time.time()
            logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
        return ret
    return wrap

Data Creation Functions

Next I have to create the data:

@timer
def make_subsets(parent_set, n):
    '''create list of subset sets, takes about 17 secs'''
    subsets = []
    for i in range(n): # use xrange in python 2
        subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
    return subsets


@timer
def include_complement(parent_set, subsets):
    '''ensure no missing elements from parent, since collected randomly'''
    union_subsets = set().union(*subsets)
    subsets_complement = set(parent_set) - union_subsets
    logging.info('len of union of all subsets was {0}'.format(
                                          len(union_subsets)))
    if subsets_complement:
        logging.info('len of subsets_complement was {0}'.format(
                                          len(subsets_complement)))
        subsets.append(subsets_complement)
    return subsets

Optional Preprocessing

I provide some preprocessing, it runs in a few seconds, but doesn't help much, only speeds up by a fraction of a second, but recorded here for the reader's edification:

@timer
def remove_redundant_subsets(subsets):
    '''
    without break, takes a while, removes 81 sets of len <= 4 (seed(0))
    in 5.5 minutes, so breaking at len 10 for 4 second completion.
    probably unnecessary if truly random subsets
    but *may* be good if large subsets are subsets of others.
    '''
    subsets.sort(key=len)
    remove_list = []
    for index, s in enumerate(subsets, 1):
        if len(s) > 10: # possible gain not worth continuing farther
            break
        if any(s.issubset(other) for other in subsets[index:]):
            logging.debug('will remove subset: {s}'.format(s=s))
            remove_list.append(s)
    logging.info('subsets removing: {0}'.format(len(remove_list)))
    for s in remove_list:
        subsets.remove(s)
    return subsets

Actual Function

Then I actually perform the Greedy Algorithm:

@timer
def greedy_set_cover(subsets, parent_set):
    parent_set = set(parent_set)
    results = []
    result_set = set()
    while result_set < parent_set:
        logging.debug('len of result_set is {0}'.format(len(result_set)))
        # maybe room for optimization here: Will still have to calculate.
        # But custom max could shortcut subsets on uncovered more than len.
        add_set = max(subsets, key=lambda x: len(x - result_set))
        logging.debug('len of add_set is {0}'.format(len(add_set)))
        results.append(add_set)
        result_set.update(add_set)
    return results

Here's the main():

# full set, use xrange instead of range in python 2 for space efficiency    
parent_set = range(PARENT_SIZE) 
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
    logging.debug('len of set {0} is {1}'.format(i, len(set)))

Final Results

And this provides a final result of 46(ish) subsets just over 3 minutes given the original parameters the questioner gave, ran in Python 2.

Here's the output for seed(0):

INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46

And here's the output for seed(1):

INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47

This was a lot of fun, thanks for the problem.

PS: I decided to try benchmarking the naive brute-force approach:

INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms

Naturally, I interrupted it, as it would not ever get close in my lifetime, perhaps our Sun's lifetime?:

>>> import math
>>> def combinations(n, k):
...     return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
... 
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L

Your 'remove_redundant_subsets' can be improved. In the context of this problem it's usually called removing dominated subsets btw. Currently your code will test whether subsets are dominated by subsets that you have already decided to remove. The usual approach is to sort subsets by length (largest first), take largest and add to 'dominators' list. Then for remaining subsets, check if it is dominated by the dominators, if so discard, otherwise add to dominators and repeat; return the dominators. Optionally, terminate when union of dominators is a cover. — bazzargh, Feb 24 '14 at 18:00
@bazzargh Thanks, I'll make that optimal change for the record after I've had a chance to study it, but for this use-case, I think it's fairly useless, given that so few (and small) subsets are removed. — Russia Must Remove Putin, Feb 24 '14 at 18:19
I had play and can make greedy_set_cover go about 3x as fast on your test data, but change is too big for a comment. I'll edit it into yours if you're ok with that, but the gist is this: use heapq to maintain the list of subsets to be processed. This lets us only consider subsets which can score better than the current best. "INFO:root:greedy_set_cover function took 31347.016 ms" -> "INFO:root:greedy_pq_set_cover function took 8845.683 ms" — bazzargh, Feb 25 '14 at 05:27
I'd rather you make your own answer, if you're going into heapq. I have absolutely no experience with it. You can say you're cueing off of me, that's not a problem. — Russia Must Remove Putin, Feb 25 '14 at 05:30

hpaulj · Answer 3 · 2014-02-24T18:32:52.287

Here's a solution using itertools.combinations to iterate over various combinations of the subsets, and union(*x) to combine them.

import itertools
subsets = [a,b,c,d,e,f]
def foo(A, subsets):
    found = []
    for n in range(2,len(subsets)):
        for x in itertools.combinations(subsets, n):
            u =  set().union(*x)
            if A==u:
                found.append(x)
        if found:
            break
    return found
print foo(A,subsets)

produces:

[(set([1, 2, 3]), set([4, 5, 6, 7]), set([8, 9, 10, 5])), 
 (set([1, 2, 3, 4]), set([4, 5, 6, 7]), set([8, 9, 10, 5]))]

For this example it runs a bit faster than your code, though if I expand it to keep track of the subset names it runs a bit slower. But this is a small example, so timings don't mean much. (edit - as demonstrated in the other answer, this approach slows down considerably with larger problems).

numpy isn't going help, since we are not dealing with array or parallel operations. As others write it is basically a search problem. You can speed up internal steps, and try to prune away deadends, but you can't get away from trying many alternatives.

The usual way to do searches in numpy is to construct a matrix of all combinations, and then pull out the desired ones with something like sum, min or max. It's a brute force approach that takes advantage of fast compiled operations on arrays.

I'm interested that you view this as a search problem because I've been thinking of it in geometric terms. In my mind what I'm trying to do is take all of these sets and place them on a grid in a way that minimizes overlaps and maximizes group size. — Ryder Brooks, Feb 23 '14 at 23:47
A lot of search problems can be visualized in geometric terms. 8 Queens is a geometric problem, but the solution is a (normally) a search. — hpaulj, Feb 24 '14 at 00:21