Loops in R with big data set, a better way?

Question

I am using R and have a big datesets containing 12,224,433 rows. For every row I want to do a spearman correlation test against one vector and extract P values. The scripts are like this:

pvals <- numeric(nrow(SNP))

for(i in 1:nrow(SNP)) {

  fit <- cor.test(vector, as.numeric(SNP[i,c(4:50)]), method='spearman', exact=FALSE)    

  pvals[i] <-  fit$p.value

  names(pvals)[i] <- paste(SNP$V1[i], SNP$V2[i])

}

The thing is it takes ages, I kind of calculate already, it took 2 hours to run only the first 70,000 rows. So it can take 200 hours. Is there anyway to speed it up?

Can you post the first couple of rows of the data set so we have a reproducible example? — Ben Bolker, Nov 20 '18 at 15:07

Shree · Answer 1 · 2018-11-20T15:22:27.307

3

Here's what I can suggest based on the info you have shared. I have added my thoughts as comments in the code -

# convert all rows to numeric matrix instead of as.numeric(SNP[i,c(4:50)]) in every loop
# also subsetting matrix directly gives you a vector which is what is needed for cor.test()
y <- as.matrix(SNP[, c(4:50)])

# initialize pvals with NA and then replace each value in every loop run
pvals <- rep(NA_real_, nrow(SNP))

for(i in 1:nrow(SNP)) {

  fit <- cor.test(vector, y[i, ], method = 'spearman', exact = FALSE)    

  pvals[i] <-  fit$p.value

}

# you can assign all names in one go instead of doing it in the loop
names(pvals) <- paste(SNP$V1, SNP$V2)

Finally, yours is a classic use case for parallel processing. Using parallel processing packages like foreach you can run multiple tests in parallel and then combine them into your result vector pval.

Also suggest you to read the book 'The R Inferno' for more info on how to improve code efficiency.

edited Nov 20 '18 at 15:22

answered Nov 20 '18 at 15:14

Shree

10,835
1
14
36

you might be able to shave a little bit more by extracting only the necessary bits from `cor.test.default()` (but it would be a nuisance) – Ben Bolker Nov 20 '18 at 15:21
@BenBolker Thanks for the suggestion. Feel free to add to or edit the answer. I am unfamiliar with `cor.test()` or `cor.test.default()` so can't do it myself. Thanks! – Shree Nov 20 '18 at 15:33
1

I thought I have to use some parallel processing packages. But by moving the name() out of the loop alone save already enormous time! – Yun Wang Nov 22 '18 at 09:02

emsinko · Answer 2 · 2018-11-20T15:16:50.370

0

You can use apply:

SNP["pvals"] <- apply(SNP[ ,c(4:50)], MARGIN = 1, FUN = function(row) cor.test(vector, as.numeric(row), method='spearman', exact=FALSE)$p.value)

#SNP$pvals

edited Nov 20 '18 at 15:16

answered Nov 20 '18 at 15:08

emsinko

171
1
6

3

Turns out `apply` behaves similar to `for` where it runs calls at the R level (not C level like other members of apply family). See this [canonical question and answer](https://stackoverflow.com/a/29006276/1422451) from @DavidArenburg who answered own question. – Parfait Nov 20 '18 at 15:12

score 0 · Accepted Answer · answered Nov 20 '18 at 16:00

This would be a good candidate for using parallel processing with a package such as foreach or future.apply.

The code below makes use of future.apply because of how simple that package is to use.

The general strategy is to take the action you want to repeat (i.e. getting p-values based on a subset of data), turn that action into a function, and use future.apply to repeat that function for the different subsets of data you want to use.

library(future.apply)

# Establish method used for parallel processing
  plan(multiprocess)

# Convert the relevant subset of the matrix to numeric
  snp_subset <- SNP[,c(4:50)]
  class(snp_subset) <- 'numeric'

# Define a function to get p.values for a given row of the matrix
  get_pvals <- function(row_index) {
    pvals <- cor.test(vector, snp_subset[row_index,], method = 'spearman', exact = FALSE)$p.value
    names(pvals) <- paste(SNP$V1[row_index], SNP$V2[row_index])
    pvals
  }

# Use parallel processing to get p-values for each row of the matrix
  pvals <- future_sapply(X = seq_len(nrow(SNP)),
                         FUN = get_pvals)

that's awsome. I will also follow the advice above to put name() outside the loop — Yun Wang, Nov 22 '18 at 10:27

Loops in R with big data set, a better way?

3 Answers3