I am using a foreach to calculate the correlation coefficients and p values, using the mtcars as an example ( foreach is overkill here but the dataframe I'm using has 450 obs for 3400 variables). I use combn to get rid of duplicate correlations and self-correlations.
combo_cars <- data.frame(t(combn(names(mtcars),2)))
library(foreach)
cars_res <- foreach(i=1:nrow(combo_cars), .combine=rbind, .packages=c("magrittr", "dplyr")) %dopar% {
out2 <- broom::tidy(cor.test(mtcars[, combo_cars[i,1]],
mtcars[,combo_cars[i,2]],
method = "spearman")) %>%
mutate(Var1=combo_cars[i,1], Var2=combo_cars[i,2])
}
I would like to convert this into a function, as I would like to try using the future package because I need to run correlations on subsections of the original dataframe and its more efficient them running in parallel. When trying to devise a function that replicates the above, I can use:
car_res2 <- data.frame(t(combn(names(mtcars), 2, function(x)
cor.test(mtcars[[x[1]]],
mtcars[[x[2]]], method="spearman"), simplify=TRUE)))
Ultimately I would like to be able to have four futures running in parallel, each computing the above on a different fraction of the dataset.
However, the car_res2 output has 8 columns instead of 7 (the second one is completely empty). I had to use the output from the cars_res to know what the values were and these were in the order of statistic, blank, p-value, estimate etc, whilst the car_res had labelled columns with estimate, statistic, p value.
- was wondering why the output is in different orders and not labelled with the second approach?
- can I use one of the apply functions in place of the above function?
Any comments would be appreciated.