Thanks for the question, I found it very interesting. I have tested the below code on Python 2.6, 2.7, and 3.3, you may find it interesting to run it yourself, I made it easy to paste into an interpreter or to run as a script.
Another solution here attempts to solve by brute force, i.e. going through every possible combination, which may be doable for ten elements, which the questioner gave as an example, but won't offer a solution for the parameters the questioner asked for, i.e., choosing a combination of subsets (up to 1500 elements long, from a superset of 15000 elements) from a set of 30,000 sets. I found for those parameters, attempting to find a solution set where n = 40 (very unlikely) would mean searching many orders of combinations over one googol, which is quite impossible.
The Setup
Here I import some modules used to benchmark my functions and create the data. I also create a timer decorator to wrap the functions so I can easily measure how much time passes before the function completes (or I give up and interrupt the function).
import functools
import time
import logging
import random
# basic setup:
logging.basicConfig(level=logging.DEBUG) # INFO or DEBUG
random.seed(1)
PARENT_SIZE = 15000
MAX_SUBSET_SIZE = 1500
N_SUBSETS = 30000
def timer(f):
'''
timer wrapper modified, original obtained from:
http://stackoverflow.com/questions/5478351/python-time-measure-function
'''
@functools.wraps(f)
def wrap(*args):
time1 = time.time()
try:
ret = f(*args)
except KeyboardInterrupt:
time2 = time.time()
logging.info('{0} function interrupted after {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
else:
time2 = time.time()
logging.info('{0} function took {1:.3f} ms'.format(f.__name__, (time2-time1)*1000.0))
return ret
return wrap
Data Creation Functions
Next I have to create the data:
@timer
def make_subsets(parent_set, n):
'''create list of subset sets, takes about 17 secs'''
subsets = []
for i in range(n): # use xrange in python 2
subsets.append(set(random.sample(parent_set, random.randint(1, MAX_SUBSET_SIZE))))
return subsets
@timer
def include_complement(parent_set, subsets):
'''ensure no missing elements from parent, since collected randomly'''
union_subsets = set().union(*subsets)
subsets_complement = set(parent_set) - union_subsets
logging.info('len of union of all subsets was {0}'.format(
len(union_subsets)))
if subsets_complement:
logging.info('len of subsets_complement was {0}'.format(
len(subsets_complement)))
subsets.append(subsets_complement)
return subsets
Optional Preprocessing
I provide some preprocessing, it runs in a few seconds, but doesn't help much, only speeds up by a fraction of a second, but recorded here for the reader's edification:
@timer
def remove_redundant_subsets(subsets):
'''
without break, takes a while, removes 81 sets of len <= 4 (seed(0))
in 5.5 minutes, so breaking at len 10 for 4 second completion.
probably unnecessary if truly random subsets
but *may* be good if large subsets are subsets of others.
'''
subsets.sort(key=len)
remove_list = []
for index, s in enumerate(subsets, 1):
if len(s) > 10: # possible gain not worth continuing farther
break
if any(s.issubset(other) for other in subsets[index:]):
logging.debug('will remove subset: {s}'.format(s=s))
remove_list.append(s)
logging.info('subsets removing: {0}'.format(len(remove_list)))
for s in remove_list:
subsets.remove(s)
return subsets
Actual Function
Then I actually perform the Greedy Algorithm:
@timer
def greedy_set_cover(subsets, parent_set):
parent_set = set(parent_set)
results = []
result_set = set()
while result_set < parent_set:
logging.debug('len of result_set is {0}'.format(len(result_set)))
# maybe room for optimization here: Will still have to calculate.
# But custom max could shortcut subsets on uncovered more than len.
add_set = max(subsets, key=lambda x: len(x - result_set))
logging.debug('len of add_set is {0}'.format(len(add_set)))
results.append(add_set)
result_set.update(add_set)
return results
Here's the main():
# full set, use xrange instead of range in python 2 for space efficiency
parent_set = range(PARENT_SIZE)
subsets = make_subsets(parent_set, N_SUBSETS)
logging.debug(len(subsets))
subsets = include_complement(parent_set, subsets) # if necessary
logging.debug(len(subsets))
subsets = remove_redundant_subsets(subsets)
logging.debug(len(subsets))
results = greedy_set_cover(subsets, parent_set)
logging.info('len of results is {0}'.format(len(results)))
for i, set in enumerate(results, 1):
logging.debug('len of set {0} is {1}'.format(i, len(set)))
Final Results
And this provides a final result of 46(ish) subsets just over 3 minutes given the original parameters the questioner gave, ran in Python 2.
Here's the output for seed(0):
INFO:root:make_subsets function took 17158.725 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2716.381 ms
INFO:root:subsets removing: 81
INFO:root:remove_redundant_subsets function took 3319.620 ms
INFO:root:greedy_set_cover function took 188026.052 ms
INFO:root:len of results is 46
And here's the output for seed(1):
INFO:root:make_subsets function took 17538.083 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2414.091 ms
INFO:root:subsets removing: 68
INFO:root:remove_redundant_subsets function took 3218.643 ms
INFO:root:greedy_set_cover function took 189019.275 ms
INFO:root:len of results is 47
This was a lot of fun, thanks for the problem.
PS: I decided to try benchmarking the naive brute-force approach:
INFO:root:make_subsets function took 17984.412 ms
INFO:root:len of union of all subsets was 15000
INFO:root:include_complement function took 2412.666 ms
INFO:root:foo function interrupted after 3269064.913 ms
Naturally, I interrupted it, as it would not ever get close in my lifetime, perhaps our Sun's lifetime?:
>>> import math
>>> def combinations(n, k):
... return math.factorial(n)/(math.factorial(k)*math.factorial(n-k))
...
>>> combinations(30000, 40)
145180572634248196249221943251413238587572515214068555166193044430231638603286783165583063264869418810588422212955938270891601399250L