This question extends this post, relating to a machine learning
feature selection
procedure where I have a large matrix of features and I'd like to perform a fast and crude feature selection
by measuring the correlation
between the outer product between each pair of features and the response, since I'll be using a random forest
or boosting
classifier
.
The number of features is ~60,000 and the number of responses is ~2,200,000.
Given unlimited memory perhaps the fastest way to go about this would be to generate a matrix
where the columns are the outer products of all pairs of features and use cor
of that matrix
against the response. As a smaller dimension example:
set.seed(1)
feature.mat <- matrix(rnorm(2200*100),nrow=2200,ncol=100)
response.vec <- rnorm(2200)
#generate indices of all unique pairs of features and get the outer products:
feature.pairs <- t(combn(1:ncol(feature.mat),2))
feature.pairs.prod <- feature.mat[,feature.pairs[,1]]*feature.mat[,feature.pairs[,2]]
#compute the correlation coefficients
res <- cor(feature.pairs.prod,response.vec)
But for my real dimensions feature.pairs.prod
is 2,200,000 by 1,799,970,000 which obviously cannot be stored in memory.
So my question is if and how it is possible to get all the correlations in reasonable computation time?
I was thinking that perhaps breaking down feature.pairs.prod
to chunks that fit in memory, and then do cor
between them and response.vec
one at a time will be the fastest but I'm not sure how to automatically test in R
what dimensions I need these chunks to be.
Another option is to apply
a function over feature.pairs
which will compute the outer product and then cor
between that and response.vec
.
Any suggestions?