2

Let's say we have the following dataset

set.seed(144) 
dat <- matrix(rnorm(100), ncol=5)

The following function creates all possible combinations of columns and removes the first

(cols <- do.call(expand.grid, rep(list(c(F, T)), ncol(dat)))[-1,])
#     Var1  Var2  Var3  Var4  Var5
# 2   TRUE FALSE FALSE FALSE FALSE
# 3  FALSE  TRUE FALSE FALSE FALSE
# 4   TRUE  TRUE FALSE FALSE FALSE
# ...
# 31 FALSE  TRUE  TRUE  TRUE  TRUE
# 32  TRUE  TRUE  TRUE  TRUE  TRUE

My question is how can I calculate single, binary and triple combinations only ?

Choosing the rows including no more than 3 TRUE values using the following function works for this vector: cols[rowSums(cols)<4L, ] However, it gives following error for larger vectors mainly because of the error in expand.grid with long vectors:

Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) : 
  invalid 'times' value
In addition: Warning message:
In rep.fac * nx : NAs produced by integer overflow

Any suggestion that would allow me to compute single, binary and triple combinations only ?

2 Answers2

2

You could try either

cols[rowSums(cols) < 4L, ]

Or

cols[Reduce(`+`, cols) < 4L, ]
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • Thanks David, it works! Is there any other way of doing this without calculating all combinations to save some time ? – Ceyda Oksel Apr 29 '15 at 09:48
  • You could do something like `library(gtools) ; permutations(2, 5, c(TRUE, FALSE), repeats.allowed = TRUE)[-c(1, 31:32), ]` but you'll probably have to make it a bit more robust – David Arenburg Apr 29 '15 at 10:07
  • It works perfectly for small-sized vectors but I am getting the following error when I work with a larger vector (>50 variables): "Error in rep.int(seq_len(nx), rep.int(rep.fac, nx)) : invalid 'times' value In addition: Warning message: In rep.fac * nx : NAs produced by integer overflow" I guess the problem is in expand.grid with very long vectors. I should find a way of calculating limited number of combinations only rather than creating all and selecting a portion. – Ceyda Oksel Apr 29 '15 at 10:21
  • Yes, it's not a robust approach. I don't have time to look into it right now. Maybe later. – David Arenburg Apr 29 '15 at 10:22
1

You can use this solution:

col.i <- do.call(c,lapply(1:3,combn,x=5,simplify=F))
# [[1]]
# [1] 1
# 
# [[2]]
# [1] 2
# 
# <...skipped...>
# 
# [[24]]
# [1] 2 4 5
# 
# [[25]]
# [1] 3 4 5

Here, col.i is a list every element of which contains column indices.

How it works: combn generates all combinations of the numbers from 1 to 5 (requested by x=5) taken m at a time (simplify=FALSE ensures that the result has a list structure). lapply invokes an implicit cycle to iterate m from 1 to 3 and returns a list of lists. do.call(c,...) converts a list of lists into a plain list.

You can use col.i to get certain columns from dat using e.g. dat[,col.i[[1]],drop=F] (1 is an index of the column combination, so you could use any number from 1 to 25; drop=F makes sure that when you pick just one column from dat, the result is not simplified to a vector, which might cause unexpected program behavior). Another option is to use lapply, e.g.

lapply(col.i, function(cols) dat[,cols])

which will return a list of data frames each containing a certain subset of columns of dat.

In case you want to get column indices as a boolean matrix, you can use:

col.b <- t(sapply(col.i,function(z) 1:5 %in% z))
#       [,1]  [,2]  [,3]  [,4]  [,5]
# [1,]  TRUE FALSE FALSE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE FALSE
# [3,] FALSE FALSE  TRUE FALSE FALSE
# ...

[UPDATE]

More efficient realization:

library("gRbase")

coli <- function(x=5,m=3) {
    col.i <- do.call(c,lapply(1:m,combnPrim,x=x,simplify=F))

    z <- lapply(seq_along(col.i), function(i) x*(i-1)+col.i[[i]])
    v.b <- rep(F,x*length(col.i))
    v.b[unlist(z)] <- TRUE
    matrix(v.b,ncol=x,byrow = TRUE)
}

coli(70,5) # takes about 30 sec on my desktop
Marat Talipov
  • 13,064
  • 5
  • 34
  • 53
  • I haven't checked but I'm sure it wirks like a charm – David Arenburg Apr 30 '15 at 08:24
  • Dear Marat, thanks a lot! the code itself is working and does what I want. However, when I try it with a larger data (70 variables) and attempts to compute all possible combinations of up to 5 variables using your codes in the following form `col.i<- do.call(c,lapply(1:5,combnPrim,x=70,simplify=F)) lapply(col.i, function(cols) data[,cols]) col.b <- t(sapply(col.i,function(z) 1:70 %in% z))` , R stops working after a while. I've tried to use `combnPrim` instead of `combn` function in gRbase package, as it is supposed to be faster, but didn't help. Any thoughts to fix it ? – Ceyda Oksel Apr 30 '15 at 11:55
  • `lapply` is basically just a loop, so there is no much room for improvement here. You can try a multi-core parallel version of `lapply` though (see http://stat.ethz.ch/R-manual/R-patched/library/parallel/html/mclapply.html). – Marat Talipov Apr 30 '15 at 15:26
  • BTW I updated the answer, is it still too slow for your purposes? – Marat Talipov Apr 30 '15 at 15:28
  • The updated version solved all of my problems! I truly can't thank you enough. – Ceyda Oksel Apr 30 '15 at 16:41