How to make my loop run faster in R?

Question

I'm using a function to get p-values from multiple HWE chi square tests. I'm looping through a large matrix called geno.data which is (313 rows x 355232 columns) to do this. I'm essentially looping two columns of the matrix at a time by row. It runs very slowly. How can I make it faster? Thanks

library(genetics)
geno.data<-matrix(c("a","c"), nrow=313,ncol=355232)
Num_of_SNPs<-ncol(geno.data) /2
alleles<- vector(length = nrow(geno.data))
HWE_pvalues<-vector(length = Num_of_SNPs)
j<- 1

for (count in 1:Num_of_SNPs){
    for (i in 1:nrow(geno.data)){
        alleles[i]<- levels(genotype(paste(geno.data[i,c(2*j -1, 2*j)], collapse = "/")))
    }
    g2 <- genotype(alleles)
    HWE_pvalues[count]<-HWE.chisq(g2)[3]
    j = j + 2
}

Please see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — csgillespie, Nov 24 '14 at 15:47
so you're doing `choose(355232, 2)` chisq tests? do you happen to know fortran? — rawr, Nov 24 '14 at 15:53
@rawr I don't know fortran. I'm using a function from an R package to do the chi square test. It's specific to my problem. — cooldood3490, Nov 24 '14 at 16:03

josliber · Accepted Answer · 2014-11-24T18:46:25.667

First, note that the posted code will result in an index-out-of-bounds error, because after Num_of_SNPs iterations of the main loop your j value will be ncol(geno.data)-1 and you're accessing columns 2*j-1 and 2*j. I'm assuming you instead want columns 2*count-1 and 2*count and j can be removed.

Vectorization is extremely important for writing fast R code. In your code you're calling the paste function 313 times, each time passing vectors of length 1. It's much faster in R to call paste once passing vectors of length 313. Here are the original and vectorized interiors of the main for loop:

# Original
get.pval1 <- function(count) {
  for (i in 1:nrow(geno.data)){
    alleles[i]<- levels(genotype(paste(geno.data[i,c(2*count -1, 2*count)], collapse = "/")))
  }
  g2 <- genotype(alleles)
  HWE.chisq(g2)[3]
}

# Vectorized
get.pval2 <- function(count) {
  g2 <- genotype(paste0(geno.data[,2*count-1], "/", geno.data[,2*count]))
  HWE.chisq(g2)[3]
}

We get about a 20x speedup from the vectorization:

library(microbenchmark)
all.equal(get.pval1(1), get.pval2(1))
# [1] TRUE
microbenchmark(get.pval1(1), get.pval2(1))
# Unit: milliseconds
#          expr       min        lq      mean    median        uq       max neval
#  get.pval1(1) 299.24079 304.37386 323.28321 307.78947 313.97311 482.32384   100
#  get.pval2(1)  14.23288  14.64717  15.80856  15.11013  16.38012  36.04724   100

With the vectorized code, your code should finish in about 177616*.01580856 = 2807.853 seconds, or about 45 minutes (compared to 16 hours for the original code). If this is still not fast enough for you, then I would encourage you to look at the parallel package in R. The mcmapply should give a good speedup for you, since each iteration of the outer for loop is independent.

How to make my loop run faster in R?

1 Answers1