I have two matrices diff.exp.protein.expr and lncRNA.expr each of which have 64 columns but different number of rows.
>dim(diff.exp.protein.expr)
[1] 14000 64
>dim(lncRNA.expr)
[1] 7600 64
I am using these matrices as input in two separate linear models where I compare each row of diff.exp.protein.expr with all rows of lncRNA.expr (=2*106 million tests). Eventually, I want to compare whether the two linear models are statistically different using anova for which I have written a function which is as follows:
lm.anova <- function(diff.exp.protein.expr,lncRNA.expr,colData)
{
tempdff <- data.frame() #create an empty dataframe
for(i in 1:nrow(diff.exp.protein.expr)) #traverse through 1st matrix
{
for(j in 1:nrow(lncRNA.expr)) #traverse through 2nd matrix
{
#model 1
lm1 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment)
#model 2 (has an additional interaction term at the end)
lm2 <- lm(diff.exp.protein.expr[i,]~colData$gender+colData$age+colData$treatment+lncRNA.expr[j,]+colData$treatment*lncRNA.expr[j,])
#get the summary of model 2
lm.model <- summary(lm2)
#get the rownames of ith row of matrix 1 and jth row of matrix 2
#get the pvalue from the anova model
#get the estimate and std. error for last two terms of model 2
#add these values as a row to tempdff
tempdff <- rbind(tempdff,data.frame(rownames(diff.exp.protein.expr)[i],rownames(lncRNA.expr)[j],
anova(lm1,lm2)$"Pr(>F)"[2]),lm.model$coefficients[4,1:2],lm.model$coefficients[6,1:2])
}
}
return(tempdff) #return results
}
lm.anova.res <- lm.anova(diff.exp.protein.expr,lncRNA.expr,colData) #call function
And this is how the data, that goes into the function, looks like:
>head(diff.exp.protein.expr)[,1:5] #log transformed values
C00060 C00079 C00135 C00150 C00154
ENSG00000000005.5 5.187977 6.323024 6.022986 5.376513 4.810042
ENSG00000000460.12 7.071433 7.302448 7.499133 7.441582 7.439453
ENSG00000000938.8 8.713285 8.584996 8.982816 9.787420 8.823927
ENSG00000001497.12 9.569952 9.658418 9.392670 9.394250 9.087214
ENSG00000001617.7 9.215362 9.417367 8.949943 8.172713 9.198314
ENSG00000001626.10 6.363638 6.038328 6.698733 5.202132 5.336239
>head(lncRNA.expr)[,1:5] #log transformed values
C00060 C00079 C00135 C00150 C00154
ENSG00000100181.17 7.477983 7.776463 7.950514 8.098205 7.278615
ENSG00000115934.11 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000122043.6 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000124915.6 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000125514.5 4.104380 4.104380 4.104380 4.104380 4.104380
ENSG00000129816.5 4.104380 4.104380 4.104380 4.104380 4.104380
>head(colData)[,1:5] #sample information for each column of the two input matrices
sample_name status age gender site treatment
C00060 NF 48 Female Cleveland case
C00079 NF 26 Female Cleveland case
C00135 NF 55 Female Cleveland case
C00150 NF 61 Male Cleveland ctrl
C00154 NF 50 Male Cleveland case
C00176 NF 60 Female Cleveland ctrl
I wrote this code when I had very few tests (~50) to be performed. Now, I have it all figured out except that I don't know how can I make my code efficient because using a for-loop in this case where 14000*7600 tests have to be performed doesn't make any sense. It would take forever to run. I would like to know what are other alternatives by which I can really speed up this code. Any suggestion would be appreciated.
UPDATE 1: I have updated my linear models a bit. Now, anova(lm1,lm2)"Pr(>F)" does not give the same value as that for the interaction term in model2.
UPDATE 2: I have updated my linear models so that lm1 is nested in lm2.
Thanks,