8

This question follows from a related question of mine posted here. @mhum suggested that my problem falls into the covering problem domain. I tried encoding my question into a minimum set cover problem and currently I have a dataset in this form:

Set        Cost
(1,2)        1
(1)          1
(1,2,3)      2
(1)          2
(3,4)        2
(4)          3
(1,2)        3
(3,4)        4
(1,2,3,4)    4

The objective is to find a good set cover that covers all numbers and one that attempts to minimize the total cost. My dataset is big with at least 30000 sets (of size varying from 5-40 elements) like this. Are there any good greedy implementations to solve this or am I on my own to implement this? I am not an expert in LP but any LP-solvers (from numpy/scipy) that can solve this are also acceptable.

Community
  • 1
  • 1
Legend
  • 113,822
  • 119
  • 272
  • 400

2 Answers2

9

There is a well-known greedy approximation algorithm for set cover that is also easy to implement in whatever language of your choice. The algorithm itself is described here:

http://en.wikipedia.org/wiki/Set_cover_problem#Greedy_algorithm

It is so simple that the easiest thing is just to write it from scratch.

Notably, it is also the best polynomial-time approximation algorithm known for set cover. That means that to get better worst-case performance (more compact result set) you would need to have non-polynomial running times (= slow algorithms for large sets).

Unfortunately the Wikipedia entry doesn't actually cover weighted set cover, which is the case here. The extension is simple, and is described e.g. here:

http://pages.cs.wisc.edu/~shuchi/courses/880-S07/scribe-notes/lecture03.pdf

Some more useful notes:

http://www.cs.ucr.edu/~neal/non_arxiv/Young08SetCover.pdf http://www.cs.uiuc.edu/class/sp08/cs473/Lectures/lec20.pdf

Legend
  • 113,822
  • 119
  • 272
  • 400
Antti Huima
  • 25,136
  • 3
  • 52
  • 71
  • +1 Thank you. Pardon my ignorance. I have one more question. I like the greedy version but I guess I am missing the cost version i.e. at least in my case, choosing (1,2) and (3,4) independently will give me a cost of 3 whereas choosing (1,2,3,4) will give me 4. I was hoping a cost-aware greedy set cover would take this into account but is the extension as simple as: "at each stage pick the one with maximum intersection **and** lowest cost"? – Legend Oct 29 '11 at 00:52
  • Yes, sorry, I missed the fact that the Wikipedia page actually didn't cover the weighted version, I thought it would have because it's kind of standard. Anyway, added a link to lecture notes that explain the weighted version also (weighted = with costs). You basically divide the cost of a new set to add with the number of elements it covers, and always pick the set with highest efficiency (lowest cost per covered elements). It's still provably good approximation algorithm. – Antti Huima Oct 29 '11 at 01:08
  • Thank you very much. I finished implementing it and posted it here: http://stackoverflow.com/questions/7942312/how-do-i-make-my-implementation-of-greedy-set-cover-faster for suggestions on how to make it faster implementation-wise. I also added some more links to the answer. Once again, thank you for your help :) – Legend Oct 29 '11 at 23:24
  • My sets are going to have a few tens of items, at most, is there an exact algorithm for the weighted set cover problem? – Stavros Korokithakis Jul 19 '13 at 12:21
  • @StavrosKorokithakis Yes, branch and bound. – Antti Huima Jul 19 '13 at 15:19
3

My linear time / space implementation of greedy set cover in c++ is available at github.

https://github.com/martin-steinegger/setcover

A calculation for 40.000.000 sets with avg. 10 elements per set takes around 4 Minutes on computed on Amazon AWS m2.2xlarge instances.

I still work on some tricks to improve the performance

  1. remove subsets that are covered by a bigger set with MinHash
  2. remove all sets that just contain one element that is no other set
martin s
  • 1,121
  • 1
  • 12
  • 29
  • Wouldn't removing subsets that are covered by a bigger set not be a valid operation in the weighted set cover? IE the smaller subsets could minimize the cost. I'm asking out of curiousity, not suggesting this is so. – Adam Hughes Feb 04 '15 at 17:23
  • @AdamHughes sorry for the delayed response. It should be possible to remove weighted sets that are fully covered by a bigger set. – martin s Mar 06 '15 at 10:28