0

This question extends this post, relating to a machine learning feature selection procedure where I have a large matrix of features and I'd like to perform a fast and crude feature selection by measuring the correlation between the outer product between each pair of features and the response, since I'll be using a random forest or boosting classifier.

The number of features is ~60,000 and the number of responses is ~2,200,000.

Given unlimited memory perhaps the fastest way to go about this would be to generate a matrix where the columns are the outer products of all pairs of features and use cor of that matrix against the response. As a smaller dimension example:

set.seed(1)
feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100)
response.vec <- rnorm(2200)

#generate indices of all unique pairs of features and get the outer products:
feature.pairs <- t(combn(1:ncol(feature.mat),2))
feature.pairs.prod <- feature.mat[,feature.pairs[,1]]*feature.mat[,feature.pairs[,2]]

#compute the correlation coefficients
res <- cor(feature.pairs.prod,response.vec)

But for my real dimensions feature.pairs.prod is 2,200,000 by 1,799,970,000 which obviously cannot be stored in memory.

So my question is if and how it is possible to get all the correlations in reasonable computation time?

I was thinking that perhaps breaking down feature.pairs.prod to chunks that fit in memory, and then do cor between them and response.vec one at a time will be the fastest but I'm not sure how to automatically test in R what dimensions I need these chunks to be.

Another option is to apply a function over feature.pairs which will compute the outer product and then cor between that and response.vec.

Any suggestions?

Community
  • 1
  • 1
dan
  • 6,048
  • 10
  • 57
  • 125
  • It's pretty similar in the sense that the most reasonable solution is my first suggestion in breaking down the feature.pairs.prod matrix to chunks and looping over them. Is there an R way to figure out the chunk size from my system's resources ,given feature.mat, though? – dan Oct 20 '16 at 17:22

1 Answers1

1

Yes, chunk-wise computation is the way to go. This is also similarly done in Out of memory when using outer in solving my big normal equation for least squares estimation.

The steps need not be changed:

set.seed(1)
feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100)
response.vec <- rnorm(2200)

#generate indices of all unique pairs of features and get the outer products:
feature.pairs <- t(combn(1:ncol(feature.mat),2))
j1 <- feature.pairs[,1]
j2 <- feature.pairs[,2]

But then we need to break j1 and j2 into chunks:

## number of data
n <- nrow(feature.mat)
## set a chunk size
k <- 1000
## start and end index of each chunk
start <- seq(1, length(j1), by = k)
end <- c(start[-1] - 1, length(j1))

## result for the i-th chunk
chunk_cor <- function (i) {
  jj <- start[i]:end[i]
  jj1 <- j1[jj]; jj2 <- j2[jj]
  feature.pairs.prod <- feature.mat[,jj1] * feature.mat[,jj2]
  cor(feature.pairs.prod,response.vec)
  }

## now we loop through all chunks and combine the result
res <- unlist(lapply(1:length(start), chunk_cor))

The major problem is how to decide k.

As demonstrated in the linked answer, we can calculate memory footprint. If you have n rows and k columns (chunk-size), the memory cost for a n * k matrix is n * k * 8 / 1024 / 1024/ 1024 GB. You can set a memory limit on entry; then since n is known, you can solve k.

Have a check on the memory costs for function f: feature.mat[,jj1], feature.mat[,jj2] and feature.pairs.prod all need be generated and stored. So we have memory size:

3 * n * k * 8 / 1024 / 1024/ 1024 GB

Now suppose we want to constrain memory footprint under 4GB, given n, we can solve k:

k <- floor(4 * 2^30 / (24 * n))
Community
  • 1
  • 1
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248