0

I have two matrices diff.exp.protein.expr and lncRNA.expr each of which have 64 columns but different number of rows.

>dim(diff.exp.protein.expr)
[1] 14000 64

>dim(lncRNA.expr)
[1] 7600 64

I am using these matrices as input in two separate linear models where I compare each row of diff.exp.protein.expr with all rows of lncRNA.expr (=2*106 million tests). Eventually, I want to compare whether the two linear models are statistically different using anova for which I have written a function which is as follows:

lm.anova <- function(diff.exp.protein.expr,lncRNA.expr,colData) 
{
    tempdff <- data.frame() #create an empty dataframe
    for(i in 1:nrow(diff.exp.protein.expr)) #traverse through 1st matrix
    {
        for(j in 1:nrow(lncRNA.expr)) #traverse through 2nd matrix
        {
            #model 1
            lm1 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment)
            #model 2 (has an additional interaction term at the end)
            lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+colData$treatment*lncRNA.expr[j,])
            #get the summary of model 2
            lm.model <- summary(lm2)
            #get the rownames of ith row of matrix 1 and jth row of matrix 2
            #get the pvalue from the anova model
            #get the estimate and std. error for last two terms of model 2
            #add these values as a row to tempdff
            tempdff <- rbind(tempdff,data.frame(rownames(diff.exp.protein.expr)[i],rownames(lncRNA.expr)[j],
                                      anova(lm1,lm2)$"Pr(>F)"[2]),lm.model$coefficients[4,1:2],lm.model$coefficients[6,1:2])
        }
    }
    return(tempdff) #return results
}

lm.anova.res <- lm.anova(diff.exp.protein.expr,lncRNA.expr,colData) #call function

And this is how the data, that goes into the function, looks like:

>head(diff.exp.protein.expr)[,1:5] #log transformed values

                     C00060   C00079   C00135   C00150   C00154
ENSG00000000005.5  5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8  8.713285 8.584996 8.982816 9.787420 8.823927
ENSG00000001497.12 9.569952 9.658418 9.392670 9.394250 9.087214
ENSG00000001617.7  9.215362 9.417367 8.949943 8.172713 9.198314
ENSG00000001626.10 6.363638 6.038328 6.698733 5.202132 5.336239

>head(lncRNA.expr)[,1:5] #log transformed values
                     C00060   C00079   C00135   C00150   C00154
ENSG00000100181.17 7.477983 7.776463 7.950514 8.098205 7.278615
ENSG00000115934.11 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000122043.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000124915.6  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000125514.5  4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000129816.5  4.104380 4.104380 4.104380 4.104380 4.104380

>head(colData)[,1:5] #sample information for each column of the two input matrices
sample_name status age gender      site treatment
     C00060     NF  48 Female Cleveland      case
     C00079     NF  26 Female Cleveland      case
     C00135     NF  55 Female Cleveland      case
     C00150     NF  61   Male Cleveland      ctrl
     C00154     NF  50   Male Cleveland      case
     C00176     NF  60 Female Cleveland      ctrl

I wrote this code when I had very few tests (~50) to be performed. Now, I have it all figured out except that I don't know how can I make my code efficient because using a for-loop in this case where 14000*7600 tests have to be performed doesn't make any sense. It would take forever to run. I would like to know what are other alternatives by which I can really speed up this code. Any suggestion would be appreciated.

UPDATE 1: I have updated my linear models a bit. Now, anova(lm1,lm2)"Pr(>F)" does not give the same value as that for the interaction term in model2.

UPDATE 2: I have updated my linear models so that lm1 is nested in lm2.

Thanks,

Komal Rathi
  • 4,164
  • 13
  • 60
  • 98
  • Try using [data.table](http://cran.r-project.org/web/packages/data.table/index.html) instead of `data.frame`. See notes [here](http://stackoverflow.com/questions/22588673/linear-regression-using-data-table-in-r) for tips on how to do this. – Lenwood Apr 24 '14 at 01:25
  • Thanks @Lenwood I am not very good at data.tables but I will give it a try. – Komal Rathi Apr 24 '14 at 15:23

3 Answers3

0

Your problem seems to be the solved by my package MatrixEQTL.

http://www.bios.unc.edu/research/genomic_software/Matrix_eQTL/

http://cran.r-project.org/web/packages/MatrixEQTL/index.html

You would need to use the modelLINEAR_CROSS model to test for the significance of the interaction term.

Please feel free to ask any questions about the package.

Andrey Shabalin
  • 4,389
  • 1
  • 19
  • 18
  • Thanks @Andrey Shabalin for the suggestion. But I think my data is really different than what is given in the manual for MatrixEQTL. Or maybe it is too advanced for me to understand. – Komal Rathi Apr 24 '14 at 15:40
  • I would like to warn you that it is not statistically correct to do ANOVA test for non-nested models. – Andrey Shabalin Apr 24 '14 at 16:24
  • But all the variables of lm1 are present in lm2. So how are they non-nested? – Komal Rathi Apr 24 '14 at 16:27
  • Variable `colData$treatment` is not nested in `lncRNA.expr[j,]*colData$treatment`. They are essentially different variables in the respective linear models. – Andrey Shabalin Apr 24 '14 at 16:29
  • Isn't lncRNA.expr[j,]*colData$treatment equal to lncRNA.expr[j,]+colData$treatment+lncRNA.expr[j,]:colData$treatment? – Komal Rathi Apr 24 '14 at 16:41
  • According to your suggestion, I have changed my linear model such that lm1 is nested within lm2. – Komal Rathi Apr 24 '14 at 16:45
  • Quite the opposite, `lncRNA.expr[j,]*colData$treatment` is equal to `lncRNA.expr[j,]+colData$treatment+lncRNA.expr[j,]*colData$treatment`. This is motivated by the agreement that the interaction terms are included only when the individual terms are also in the model. – Andrey Shabalin Apr 24 '14 at 16:50
  • After the last update, are you sure you want to test to joint contribution to the model of `lncRNA.expr[j,]` and `colData$treatment*lncRNA.expr[j,]`? Or would your scientific question be better answered by testing only the interaction term? – Andrey Shabalin Apr 24 '14 at 16:56
0

So there are a couple of things.

First, calling data.frame(...) is very expensive, so calling it at each iteration will slow things down considerably. Lists, on the other hand, are extremely efficient. So try to keep everything in lists until the end.

Second, it might just be that running 2 * 106 million regressions is taking most of the time.

I would be inclined to try it this way:

## not tested...
df1 <- t(diff.exp.protein.expr)
df2 <- t(lncRNA.expr)

df  <- cbind(df1,df2,colData)
names <- expand.grid(x=colnames(df1),y=colnames(df2),stringsAsFactors=FALSE)
get.anova <- function(n){
  form.1 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2]))
  form.2 <- as.formula(paste0(n[1],"~gender+age+treatment+",n[2],"+treatment:",n[2]))
  lm1    <- lm(form.1,data=df)
  lm2    <- lm(form.2,data=df)
  coef <- summary(lm2)$coefficients
  list(n[1],n[2],anova(lm1,lm2)$"Pr(>F)"[2],coef[5,1],coef[5,2],coef[6,1],coef[6,2])
}
result <- do.call(rbind,apply(names,1,get.anova))
result <- data.frame(result)
colnames(result) <- c("protein","lncRNA","p.value","est.1","se.1","est.2","se.2")

This was not tested because the dataset you provided was not big enough: the models have < 0 df for error. Consequently, there is no row 6 in the coefficients table. Your real dataset will not have that problem.

EDIT (Response to OP's comment and provision of data).

With the dataset provided in your comment, we can benchmark both the original code (based on your updated post), and the new version (based on the code above, updated to reflect your new model formulas). In both case I use just 10 rows from each dataset (100 combinations).

f.original <- function() lm.anova(diff.exp.protein.expr.sub[1:10,],lncRNA.expr.sub[1:10,],colData)
f.new      <- function() do.call(rbind,apply(names,1,get.anova))
library(microbenchmark)
microbenchmark(f.original(), f.new(), times=10)
# Unit: seconds
#          expr      min       lq   median       uq      max neval
#  f.original() 2.622461 2.712527 2.824419 2.869193 2.914099    10
#       f.new() 2.049756 2.074909 2.144422 2.191546 2.224831    10

So we can see that the new version, which returns lists instead of data frames, is about 25% faster.

Profiling both approaches, using Rprof, shows that the original version spends about 50% of it's time in lm(...), whereas the new version spends about 60% of it's (shorter) time in lm(...). This makes sense and suggests that the best you could possibly do is an improvement of about 30% over the new version. In other words: the calls to lm(...) are the bottlenecks: 200 million calls to lm(...) just takes a long time.

You could consider a parallel computing approach, for example using the foreach package, but before going down that road IMO you should consider other strategies for getting to your end result.

jlhoward
  • 58,004
  • 7
  • 97
  • 140
  • Hi @jlhoward, thanks for the reply. I tried your code but I get an error during the function calling step. Error in FUN(newX[, i], ...) : object 'x' not found – Komal Rathi Apr 24 '14 at 14:40
  • I fixed the code, it was something to do with list(n[1],n[2]...) instead of list(x,y...). So your code runs fine but is there a way I can really speed up the code? @jlhoward – Komal Rathi Apr 24 '14 at 15:16
  • Yes, there was an error. I edited the code to reflect your comment. So I take it this does *not* speed up the process? Can you provide a useable sample, say the first 100 rows ( & *all 64 columns*) of `diff.exp.protein.expr` and `lncRNA.expr` and all of `colData`? You could upload these somewhere and post a link in your question. – jlhoward Apr 24 '14 at 15:53
  • I can upload it on the dropbox and share it with you. Will that be ok? – Komal Rathi Apr 24 '14 at 15:56
  • So, no matter what I do, it is going to take a long time? I will try figuring it out using data.table/for-each/ddply and other stuff. Thanks for all that help. – Komal Rathi Apr 24 '14 at 18:29
0

The main time consumer here is not the nested loop but lm. And here are a couple of things you can optimize (but see also Andrey Shabalin's answer - he may have an easier solution for your case).

You have two lm's, in simplified form:

        lm1 <- lm(Y~ X1 + X2 + X3 + X4)
        lm2 <- lm(Y~ X1 + X2 + X3 + X4 + X3:X4)

Then you make a summary of lm2 and compare lm1 with anova. Because you extract just the p-value from anova, but the p-value is identical to the p-value for interaction term in lm2, then doing anova is redundant. And lm1 is consequently also redundant. Then using rbind to increment a data frame is a waste of time, so I'd suggest using a list and just add a new element at each iteration. so your code within the loop could be simplified to:

# lm1 <- ... # skip this 
#model 2 (has an additional interaction term at the end)
lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+lncRNA.expr[j,]:colData$treatment)
#get the summary of model 2, * and coefficients from that summary *
lms <- coef(summary(lm2))
tempdff[[length(tempdff)+1]]  <- data.frame(rownames(diff.exp.protein.expr)[i],
     rownames(lncRNA.expr)[j], lms[6,4],lms[5,1:2],lms[6,1:2]) 

In addition, you have the result as a data.frame - using a list instead would save some time too.

A next step could be examining what the lm and summary.lm are doing. You don't need all of what it does, you just use b's and their standard errors, as well as the p-value from the last row. You might be able to skip some of the computations that lm and summary.lm do .. with the help of perhaps some matrix algebra.

lebatsnok
  • 6,329
  • 2
  • 21
  • 22
  • Wait! I actually want to extract out the p-value that anova returns, the p-value that tells you whether two models are same or different. I just realized what I have been extracting is the same p-value as the one for the interaction term in Model2. How do I extract the p-value that anova returns? – Komal Rathi Apr 24 '14 at 13:54
  • I have updated my models. Now the p-value returned by anova is not identical to the p-value for interaction term in lm2. @lebatsnok – Komal Rathi Apr 24 '14 at 14:17
  • The p-value for anova is not equal to the p-value for interaction term exactly because the models are not nested. Otherwise they would be. – Andrey Shabalin Apr 24 '14 at 16:38
  • So now that I have updated my model accoring to your suggestion, the p-value of anova(lm1,lm2)"Pr(>F)"[2] will be same as the interaction term for lm2? Then what is the point of doing an anova?? Doesn't it give you the p-value of whether or not your models are statistically different or not? – Komal Rathi Apr 24 '14 at 16:55
  • The point of anova is to test the contribution of several variables at once. Testing for significance of a single variable with anova is clearly not the most efficient approach. I suggest focusing on the actual research question and the hypothesis being tested. Whether it would be anova or a simple significance-of-a-variable test that is best for you would follow from the statistical setup. – Andrey Shabalin Apr 24 '14 at 16:59
  • If you have nested models that differ just by the interaction term (as in your example above) then the p-value of the interaction term is *always* *exactly* *identical* to the p-value from `anova` comparing the two models. If you have something different then you probably should rephrase your question. (You can't compare non-nested models with `anova`. But the general idea is to test the addition of several terms at once, not just one as Andrey Shabalin commented above.) – lebatsnok Apr 24 '14 at 18:43