lapply with growing data.table function in R

Question

I come from a Java/Python comp sci theory background so I am still getting used to the various R packages and how they can save run time in functions.

Basically, I am working on a few projects and all of them involve taking individual factors in a long-list data set (15,000 to 200,000 factors) and performing calculations on individual factors in an equally-large data set, and concurrently storing the results of those calculations in an exponentially-longer data frame.

So far I have been using nested while loops and concatenating into a growing list, but that is taking days. Ive recently learned about 'lapply' and the 'data.frame' options in R, and I would love to see an example of how to apply (no pun intended) them to the following basic correlation function:

Corr<-function(miRdf, mRNAdf)
{
j=1  
k=1
m=1
n=1
c=0
corrList=NULL
while(n<=71521)
{
  while(m<=1477)
  {    
  corr=cor(as.numeric(miRdf[k,2:13]), as.numeric(mRNAdf[j,2:13]), use ="complete.obs")
  corrList<-c(corrList, corr)
  j=j+1
  c=c+1
  print(c)  #just a counter to see how far the function has run
  m=m+1
}
k=k+1
n=n+1
j=1
m=1         #to reset the inner while loop
}
corrList<-matrix(unlist(corrList), ncol=1477, byrow=FALSE)
colnames(corrList)<-miRdf[,1]
rownames(corrList)<-mRNAdf[,1]
write.csv(corrList, "testCorrWhole.csv")
}

As you can see, the nested while loop results in 105,636,517 (71521x1477) miRNA vs mRNA expression-value correlation scores that need to be performed and stored in a data frame that is 1477 cols x 71521 rows in order to generate a scoring matrix.

My question is, can anyone shed light on how to turn the above monstrosity into an efficient function that utilizes 'lapply' instead of the while loops, and uses the 'data.table' set() function to do away with the inefficiency of concatenating a list during every pass through the loop?

Thank you in advance!

Could you add some sample datasets please? This will help people work on this question more easily. — Steph Locke, Dec 04 '13 at 09:32
How would me adding these enormous datasets in here help anyone? They are just data frames with 8 columns each and 1477 and 71,521 rows respectively. Each row of one df is compared to each row of the second. I am just trying to figure out how to "re-write" the above code using 'lapply' and 'data.table'. Anyone who is familiar with those functions will easily be able to comprehend what my original code does. — tomathon, Dec 04 '13 at 09:43
Let's assume you googled and you've already read the articles out there that could help eg http://www.jstatsoft.org/v40/i01/ . If you already have then you're here for specific advice - this is where a reproducible set of code comes in handy as it makes it easier for people to get up and running solving your problem. The harder you make it, the less likely you are to get an answer. As for how to do it - you don't need a full data.frame, just a subset using some rnorm() vectors. http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example — Steph Locke, Dec 04 '13 at 09:57
What is this "googled" you speak of? IS that an R package too? Look man, Ive read the lit that I could find: http://www.burns-stat.com/pages/Tutor/R_inferno.pdf, http://cran.r-project.org/web/packages/data.table/index.html, http://cran.r-project.org/web/packages/data.table/data.table.pdf, etc. I am just looking for some tips as to how to implement an already working function that I wrote into a more efficient and time-effective function. Thanks for the links. I will read them just as I have read the others. I was just looking for a more direct approach to my problem. — tomathon, Dec 04 '13 at 10:05
@TomA if you just `dput` the first 50 rows then people could exactly interpret what your code is doing...have a look around SO [r] at the Q with lots of answers and those with none. — Stephen Henderson, Dec 04 '13 at 10:12

Martin Morgan · Accepted Answer · 2013-12-04T19:45:48.037

Your names end with 'df', which makes it seem like your data are a data.frame. But @Troy's answer uses a matrix. A matrix is appropriate when the data are homogeneous, and generally matrix operations are much faster than data.frame operations. So you can see already that if you'd provided a small example of your data set (e.g., dput(mRNAdf[1:10,]) that people might be in a better position to help you; this is what they're asking for.

In large numerical calculations it makes sense to 'hoist' any repeated calculations outside the loop, so they are performed only once. Repeated calculations in your case include sub-setting to columns 2:13, and coercion to numeric. With this idea, and guessing that you actually have a data.frame where each column is already a numeric vector, I'd start with

mRNAmatrix <- as.matrix(mRNAdf[,2:13])
miRmatrix <- as.matrix(miRdf[,2:13])

From the help page ?cor we see that the arguments can be a matrix, and if so the correlation is calculated between columns. You're interested in the result when the arguments are transposed relative to your current representation. So

result <- cor(t(mRNAmatrix), t(miRmatrix), use="complete.obs")

This is fast enough for your purposes

> m1 = matrix(rnorm(71521 * 12), 71521)
> m2 = matrix(rnorm(1477 * 12), 1477)
> system.time(ans <- cor(t(m1), t(m2)))
   user  system elapsed 
  9.124   0.200   9.340 
> dim(ans)
[1] 71521  1477

result is the same as your corrList -- it's not a list, but a matrix; probably the row and column names have been carried forward. You'd write this to a file as you do above, write.csv(result, "testCorrWhole.csv")

Thanks! So in the code above, where is the script that outputs the individual cor scores into an "output" data frame? Sorry if I am not understanding fully. — tomathon, Dec 04 '13 at 18:59
I've updated my reply. The result is in a matrix, not a data.frame (in your code too, I think) and can be written to a file with `write.csv`. — Martin Morgan, Dec 04 '13 at 19:46
That code went from taking 14hrs (the way I had it written) to 45 seconds your way. I cant thank you enough! — tomathon, Dec 04 '13 at 21:44

Troy · Answer 2 · 2013-12-04T11:47:48.590

UPDATED BELOW TO SHOW PARALLEL PROCESSING - ABOUT A 60% SAVING

Using apply() might not be quick enough for you. Here's how to do it, though. Will have a think about performance since this example (1M output correlations in 1000x1000 grid) takes over a minute on laptop.

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }

system.time(apply(mRNAdf,1,function(x)corRow(x[1])))
# user  system elapsed 
# 72.94    0.00   73.39

And with parallel::parApply on a 4 core Win64 laptop

require(parallel) ## Library to allow parallel processing

miRdf=matrix(rnorm(13000,10,1),ncol=13)
mRNAdf=matrix(rnorm(13000,10,1),ncol=13)
miRdf[,1]<-1:nrow(miRdf)     # using column 1 as indices since they're not in the calc.
mRNAdf[,1]<-1:nrow(mRNAdf)

corRow<-function(y){
    apply(miRdf,1,function(x)cor(as.numeric(x[2:13]), as.numeric(mRNAdf[y,2:13]), use ="complete.obs"))
  }


      # Make a cluster from all available cores
      cl=makeCluster(detectCores()) 
      # Use clusterExport() to distribute the function and data.frames needed in the apply() call
      clusterExport(cl,c("corRow","miRdf","mRNAdf"))
      # time the call
      system.time(parApply(cl,mRNAdf,1,function(x)corRow(x[[1]])))

      # Stop the cluster
      stopCluster(cl)

      # time the call without clustering
      system.time(apply(mRNAdf,1,function(x)corRow(x[[1]])))

      ## WITH CLUSTER (4)
      user  system elapsed 
      0.04    0.03   29.94 

      ## WITHOUT CLUSTER
      user  system elapsed 
      73.96    0.00   74.46

Thanks man. I will definately give it a shot. I really appreciate you actually offering a real answer. Im still looking for a way to utilize data.table too. But either way, I really appreciate some actual advice. I was beginning to think that giving reason to not offer advice (as opposed to actually offering some) was a standard practice for this forum. Thank you for providing evidence to the contrary. — tomathon, Dec 04 '13 at 10:34
@TomA - thinking about it I think a lot of the time is in the actual 1M correlation calcs. I've edited the example to show how you can process in parallel using parApply(). On my laptop I was able to get about a 60% performance improvement over apply() — Troy, Dec 04 '13 at 11:49
I am looking over your code right now. I am having trouble understanding at what point you are creating the output data frame that contains the scores of the individual correlation scores? Thanks again! — tomathon, Dec 04 '13 at 17:52

lapply with growing data.table function in R

2 Answers2