7

I am using R to analyze genome-wide association study data. I have about 500,000 potential predictor variables (single-nucleotide polymorphisms, or SNPs) and want to test the association between each of them and a continuous outcome (in this case low-density lipoprotein concentration in the blood).

I have already written a script that does this without problem. To briefly explain, I have a data object, called "Data". Each row corresponds to a particular patient in the study. There are columns for age, gender, body mass index (BMI), and blood LDL concentration. There are also half a million other columns with the SNP data.

I am currently using a for loop to run the linear model half a million times, as shown:

# Repeat loop half a million times
for(i in 1:500000) {

# Select the appropriate SNP
SNP <- Data[i]

# For each iteration, perform linear regression adjusted for age, gender, and BMI and save the result in an object called "GenoMod"
GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)

# For each model, save the p value and error for each SNP. I save these two data points in columns 1 and 2 of a matrix called "results"
results[i,1] <- summary(GenoMod)$coefficients["Geno","Pr(>|t|)"]
results[i,2] <- summary(GenoMod)$coefficients["Geno","Estimate"]
}

All of that works fine. However, I would really like to speed up my analysis. I've therefore been experimenting with the multicore, DoMC, and foreach packages.

My question is, could someone please help me adapt this code using the foreach scheme?

I am running the script on a Linux server that apparently has 16 cores available. I've tried experimenting with the foreach package, and my results using it have been comparatively worse, meaning that it takes longer to run the analysis using foreach.

For example, I've tried saving the linear model objects as shown:

library(doMC)
registerDoMC()
results <- foreach(i=1:500000) %dopar% { lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data) }

This takes more than twice as long as using just a regular for loop. Any advice on how to do this better or more quickly would be appreciated! I understand that using the parallel version of lapply might be an option, but don't know how to do this either.

All the best,

Alex

Alexander
  • 977
  • 1
  • 13
  • 29
  • Update to R 2.14 and use the `parallel` package. And while we're at it, giving us a reproducible example to work with, will surely help as well. See [this question](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – Joris Meys Dec 13 '11 at 11:40
  • Joris, thanks for the advice. I have found the documentation for `parallel` at http://www.biomedcentral.com/content/pdf/1471-2105-9-390.pdf and will read it now. – Alexander Dec 13 '11 at 11:53
  • Is package `snowfall` off the table (don't hit me, Dirk)? – Roman Luštrik Dec 13 '11 at 13:13
  • @RomanLuštrik *SMASH* I'll do it :-). `snowfall` is supposed to work with the `parallel` package, but on Linux you better `multicore`. Works even easier than `snowfall` afaik. – Joris Meys Dec 13 '11 at 14:16

1 Answers1

8

To give you a startup: If you use Linux, you can do the multicore approach contained within the parallel package. Whereas you needed to set up the whole thing when using eg the foreach package, that's not necessary any more with this approach. Your code would be run on 16 cores by simply doing :

require(parallel)

mylm <- function(i){
  SNP <- Data[i]
  GenoMod  <- lm(bloodLDLlevel ~ SNP + Age + Gender + BMI, data = Data)
  #return the vector
  c(summary(GenoMod)$coefficients["Geno","Pr(>|t|)"],
    summary(GenoMod)$coefficients["Geno","Estimate"])
}

Out <- mclapply(1:500000, mylm,mc.cores=16) # returns list
Result <- do.call(rbind,Out) # make list a matrix

Here you make a function that returns a vector with the wanted quantities, and apply the indices over this. I couldn't check this though as I don't have access to the data, but it should work.

Joris Meys
  • 106,551
  • 31
  • 221
  • 263
  • Joris, thanks for the help! I've implemented your solution and it has seemingly worked. I just ran a job that previously took over 12 hours and it has come out of the oven in just 15 minutes! Now I just wish that I had asked you this question three months ago! – Alexander Dec 13 '11 at 14:15