4

I'm trying to solve or implement an approximation to the set cover problem in R. Given a data frame like this.

  sets n
1   s1 1
2   s1 2
3   s1 3
4   s2 2
5   s2 4
6   s3 3
7   s3 4
8   s4 4
9   s4 5

The unique number of elements in column n are:

unique(d$n)
[1] 1 2 3 4 5

I'd like to calculate the smaller number of sets (column sets) that cover all the unique elements in n (universe). In this example two sets: s1 {1, 2, 3} and s4 {4, 5}. I've read about it in wikipedia and on the Internet, and I know a greedy algorithm could be applied to find an approximation. I've checked too this link in which they mention two packages to solve such problems, LPsolve and Rsymphony, but I don't know even how to start. In my real life example, I have more than 40,000 sets and each set between 1,000 and 10,000 elements and a unviverse or unique elements of 80,000.

Any help or guide about how to start or proceed will be very much appreciated.

data

d <- structure(list(sets = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 
4L, 4L), .Label = c("s1", "s2", "s3", "s4"), class = "factor"), 
    n = c(1, 2, 3, 2, 4, 3, 4, 4, 5)), .Names = c("sets", "n"
), row.names = c(NA, -9L), class = "data.frame")
IRTFM
  • 258,963
  • 21
  • 364
  • 487
mpalanco
  • 12,960
  • 2
  • 59
  • 67
  • 2
    This problem is known to be NP complete, and arises in big data problems e.g., in computational biology. Use a greedy algorithm to construct a cover as follows, and has the interpretation 'Select the set with the most uncovered elements at each step': 1) Select the set with the maximum number of elements of the original set. 2) Remove this set as well as all of its elements from the remaining other subsets. 3) Recurse. – shayaa Aug 27 '16 at 18:55
  • @shayaa I've read in the wikipedia and in other pages about the greedy algorithm approach you mentioned, but I don't know how to implement it. I've seen some [python implementation](http://stackoverflow.com/questions/7942312/how-do-i-make-my-implementation-of-greedy-set-cover-faster). It seems strange that I cannot find any R example of it since I assume it's a common problem. If you happen to know an example or link to it it'd be great. Thank you – mpalanco Sep 02 '16 at 07:12

1 Answers1

3

The lpSolve package is available on CRAN for linear programing problems. Using your link which had a response from the very reputable Hans Borchers, as well as a slightly more complex example (starting on pg 4/5) in http://math.mit.edu/~goemans/18434S06/setcover-tamara.pdf as templates to understand teh proper structure of the setup, and then following along with modifications to the first example in ?lp:

library( lpSolve)
?lp
# In Details: "Note that every variable is assumed to be >= 0!"
# go from your long-form rep of the sets to a wide form for a matrix representation
( items.mat<- t(table(d$sets,d$n))  )  # could have reversed order of args to skip t()
#---------
> dimnames(items.mat) = list( items=1:5, sets=paste0("s", 1:4) )
> items.mat
     sets
items s1 s2 s3 s4
    1  1  0  0  0
    2  1  1  0  0
    3  1  0  1  0
    4  0  1  1  1
    5  0  0  0  1
#---------
f.obj <-  rep(1,4)  # starting values of objective parameters by column (to be solved)
f.dir <- rep(">=",5) # the constraint "directions" by row
f.rhs <- rep(1,5)    # the inequality values by row (require all items to be present)

lp ("min", f.obj, items.mat, f.dir, f.rhs)$solution
#[1] 1 0 0 1

So sets s1 and s4 are a minimal cover. The "column coefficients" determine choice of "sets".

IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Thank you very much – mpalanco Aug 28 '16 at 20:50
  • I'd be interested in hearing about how this worked on a problem that was not so "obvious". Was the performance satisfactory? – IRTFM Aug 29 '16 at 01:29
  • 1
    You solution helped me because, you solved the example I asked, so I tried other examples and then on my larger set. Secondly, the links and the lp help you pointed out, gave me a better understanding about the problem. I terms of performance I couldn’t solve one of the cases due to memory problems; the object size exceeded my RAM limit. I tried in a computer with a larger RAM and I could solve bigger matrices. Probably I was a trying a matrix too big. Even if this code could be optimized, I doubt I could handle such matrices without a more powerful RAM. Thank you! – mpalanco Aug 29 '16 at 21:17