R: Find set of columns which contain most 1s in matrix of 0 and 1

Question

I have a matrix of 1s and 0s where the rows are individuals and the columns are events. A 1 indicates that an event happened to an individual and a 0 that it did not.

I want to find which set of (in the example) 5 columns/events that cover the most rows/individuals.

Test Data

#Make test data
set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30

My attempt

My initial attempt was just based on combining the set of 5 columns with the highest colMeans:

#Get top 5 columns with highest row coverage
col_set <- head(sort(colMeans(d), decreasing = T), 5)

#Have a look the set
col_set

>
      197       199        59        80        76 
0.2666667 0.2666667 0.2333333 0.2333333 0.2000000

#Check row coverage of the column set
sum(apply(d[,colnames(d) %in% names(col_set)], 1, sum) > 0) / 30 #top 5

>
[1] 0.7

However this set does not cover the most rows. I tested this by pseudo-random sampling 10.000 different sets of 5 columns, and then finding the set with the highest coverage:

#Get 5 random columns using colMeans as prob in sample
##Random sample 10.000 times
set.seed(123)
result <- lapply(1:10000, function(x){
  col_set2 <- sample(colMeans(d), 5, F, colMeans(d))
  cover <- sum(apply(d[,colnames(d) %in% names(col_set2)], 1, sum) > 0) / 30 #random 5
  list(set = col_set2, cover = cover)
})

##Have a look at the best set
result[which.max(sapply(result, function(x) x[["cover"]]))]

>
[[1]]
[[1]]$set
        59        169        262         68        197 
0.23333333 0.10000000 0.06666667 0.16666667 0.26666667 

[[1]]$cover
[1] 0.7666667

The reason for supplying the colMeans to sample is that the columns with the highest coverages are the ones I am most interested in.

So, using pseudo-random sampling I can collect a set of columns with higher coverage than when just using the top 5 columns. However, since my actual data sets are larger than the example I am looking for a more efficient and rational way of finding the set of columns with the highest coverage.

EDIT

For the interested, I decided to microbenchmark the 3 solutions provided:

#Defining G. Grothendieck's coverage funciton outside his solutions
coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

#G. Grothendieck top solution
solution1 <- function(d){
  cols <- tail(as.numeric(names(sort(colSums(d)))), 20)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#G. Grothendieck "Older solution"
solution2 <- function(d){
  require(lpSolve)
  ones <- rep(1, 300)
  res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)
  m <- matrix(res$solution[1:3000] == 1, 300)
  cols <- which(rowSums(m) > 0)
  co <- combn(cols, 5)
  itop <- which.max(apply(co, 2, coverage))
  co[, itop]
}

#user2554330 solution
bestCols <- function(d, n = 5) {
  result <- numeric(n)
  for (i in seq_len(n)) {
    result[i] <- which.max(colMeans(d))
    d <- d[d[,result[i]] != 1,, drop = FALSE]
  }
  result
}

#Benchmarking...
microbenchmark::microbenchmark(solution1 = solution1(d),
                               solution2 = solution2(d),
                               solution3 = bestCols(d), times = 10)

>
Unit: microseconds
      expr        min         lq       mean      median         uq       max neval
 solution1 390811.850 497155.887 549314.385 578686.3475 607291.286 651093.16    10
 solution2  55252.890  71492.781  84613.301  84811.7210  93916.544 117451.35    10
 solution3    425.922    517.843   3087.758    589.3145    641.551  25742.11    10

Hi, does this answer your question: https://stackoverflow.com/q/57534107/6574038 ? — jay.sf, Aug 27 '20 at 16:33
I figured I could not have been the only one faced with this problem @jay.sf . The question you asked actually reflects how I originally wanted to solve my problem. In my case I still only wanted to find the minimum number of columns - keeping all the rows. However, your question allows for minimising both the number of rows and columns simultaneously. This is an interesting extra level of complexity! — embacify, Aug 28 '20 at 09:12
Regarding the benchmark, in solution1 the speed is actually a function of how many top columns, N, you use. The lower that N is the faster it runs but the higher that N is the better the solution. When I tried it for N = 5, 10, 15, 20 it took 1x, 13x, 140x and 721x the time of N=5 on my PC based on the median time column and gave coverages of 83.3%, 86.7%, 90% and 93.3% — G. Grothendieck, Aug 28 '20 at 11:45

score 3 · Answer 1 · answered Aug 27 '20 at 16:12

This looks like a relatively hard optimization problem, because of the ways columns interact. An approximate strategy would be to pick the column with the highest mean; then delete the rows with ones in that column, and repeat. You won't necessarily find the best solution this way, but you should get a fairly good one.

For example,

set.seed(123)
d <- sapply(1:300, function(x) sample(c(0,1), 30, T, c(0.9,0.1)))
colnames(d) <- 1:300
rownames(d) <- 1:30
bestCols <- function(d, n = 5) {
  result <- numeric(n)
  for (i in seq_len(n)) {
    result[i] <- which.max(colMeans(d))
    d <- d[d[,result[i]] != 1,, drop = FALSE]
  }
  cat("final dim is ", dim(d))
  result
}
col_set <- bestCols(d)
sum(apply(d[,colnames(d) %in% col_set], 1, sum) > 0) / 30 #top 5

This gives 90% coverage.

This is an interesting and quite elegant solution to the problem! However, I ended up accepting @G. Grothendieck answer because his brute force solutions had higher coverage. — embacify, Aug 28 '20 at 09:16

G. Grothendieck · Accepted Answer · 2020-09-01T12:13:38.517

The following provides a heuristic to find an approximate solution. Find the N=20 columns, say, with the most ones, cols, and then use brute force to find every subset of 5 columns out of those 20. The subset having the highest coverage is shown below and its coverage is 93.3%.

coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

N <- 20
cols <- tail(as.numeric(names(sort(colSums(d)))), N)

co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1]  90 123 197 199 286

coverage(co[, itop])
## [1] 0.9333333

Repeating this for N=5, 10, 15 and 20 we get coverages of 83.3%, 86.7%, 90% and 93.3%. The higher the N the better the coverage but the lower the N the less the run time.

Older solution

We can approximate the problem with a knapsack problem that chooses the 5 columns with largest numbers of ones using integer linear programming.
We get the 10 best solutions to this approximate problem, get all columns which are in at least one of the 10 solutions. There are 14 such columns and we then use brute force to find which subset of 5 of the 14 columns has highest coverage.

library(lpSolve)

ones <- rep(1, 300)
res <- lp("max", colSums(d), t(ones), "<=", 5, all.bin = TRUE, num.bin.solns = 10)

coverage <- function(ix) sum(rowSums(d[, ix]) > 0) / 30

# each column of m is logical 300-vector defining possible soln
m <- matrix(res$solution[1:3000] == 1, 300)

# cols is the set of columns which are in any of the 10 solutions
cols <- which(rowSums(m) > 0)
length(cols)
## [1] 14

# use brute force to find the 5 best columns among cols
co <- combn(cols, 5)
itop <- which.max(apply(co, 2, coverage))
co[, itop]
## [1]  90 123 197 199 286
coverage(co[, itop])
## [1] 0.9333333

GKi · Answer 3 · 2020-09-01T14:29:32.557

You can try to test if there is a better column and exchange this with the one currently in the selection.

n <- 5 #Number of columns / events
i <- rep(1, n)
for(k in 1:10) { #How many times itterate
  tt <- i
  for(j in seq_along(i)) {
    x <- +(rowSums(d[,i[-j]]) > 0)
    i[j] <- which.max(colSums(x == 0 & d == 1))
  }
  if(identical(tt, i)) break
}
sort(i)
#[1]  90 123 197 199 286
mean(rowSums(d[,i]) > 0)
#[1] 0.9333333

Taking into account, that the initial condition influences the result you can take random starts.

n <- 5 #Number of columns / events
x <- apply(d, 2, function(x) colSums(x == 0 & d == 1))
diag(x) <- -1
idx <- which(!apply(x==0, 1, any))
x <- apply(d, 2, function(x) colSums(x != d))
diag(x) <- -1
x[upper.tri(x)] <- -1
idx <- unname(c(idx, which(apply(x==0, 1, any))))
res <- sample(idx, n)
for(l in 1:100) {
  i <- sample(idx, n)
  for(k in 1:10) { #How many times itterate
    tt <- i
    for(j in seq_along(i)) {
      x <- +(rowSums(d[,i[-j]]) > 0)
      i[j] <- which.max(colSums(x == 0 & d == 1))
    }
    if(identical(tt, i)) break
  }
  if(sum(rowSums(d[,i]) > 0) > sum(rowSums(d[,res]) > 0)) res  <- i
}
sort(res)
#[1]  90 123 197 199 286
mean(rowSums(d[,res]) > 0)
#[1] 0.9333333

R: Find set of columns which contain most 1s in matrix of 0 and 1

3 Answers3

Older solution