1

I have a 'homework' to generate a matrix which records some matched information. I know I can easily figure this out by for-loop but I really have a rather big matrix which shall end up with endless waiting. I wish if there is a faster way to get there.

Basically, what I want to do is to find a matched gene expression and pathway. To be specific, if a gene (e.g., gene 1) belongs to a pathway (e.g., pathway1), then give the corresponding expression to the gene1-pathway1 combination. However, if the gene is not in this pathway, 0 is assigned. See, it is simple but I am stuck. Please see the following example to express what I want to get.

path <- read.table(header = T,text = "pathway   gene
pathway1    gene1
           pathway1 gene2
           pathway1 gene3
           pathway1 gene4
           pathway2 gene1
           pathway2 gene5
           pathway3 gene3
           pathway3 gene6
           pathway3 gene7
           ")

expr <- read.table(header = T,text = "gene  expression
gene1   1
gene2   2
gene3   3
gene4   4
gene5   5
gene6   6
gene8 8
")

out <- matrix(0,
              nrow = length(unique(path$pathway)),
              ncol = length(unique(expr$gene)),
              dimnames = list(unique(path$pathway),unique(expr$gene)))

for (p in rownames(out)) {
  for (g in colnames(out)) {
    tmp <- path[which(path$pathway == p),]
    if(is.element(g,tmp$gene)) {
      out[p,g] <- expr[which(expr$gene == g),"expression"]
    } else {next()}
  }
}
print(out)
#          gene1 gene2 gene3 gene4 gene5 gene6 gene8
# pathway1     1     2     3     4     0     0     0
# pathway2     1     0     0     0     5     0     0
# pathway3     0     0     3     0     0     6     0

The expected output has been printed above, but I wonder if there is a faster way (I mean really faster) to get there because I have a really big matrix to deal with.

Hope someone could give me some help. Many thanks advanced!

Sugus
  • 23
  • 4

2 Answers2

2

It looks like in the final output you need only those genes which are present in expr (since gene7 is absent in your final output). Using base R we can filter first only for those values, then make gene a factor variable with levels specified from expr$gene and then use table to get

path = path[path$gene %in% expr$gene, ]
path$gene <- factor(path$gene, levels = expr$gene)
table(path)

#          gene
#pathway    gene1 gene2 gene3 gene4 gene5 gene6 gene8
#  pathway1     1     1     1     1     0     0     0
#  pathway2     1     0     0     0     1     0     0
#  pathway3     0     0     1     0     0     1     0

However, if you want to replace these 1's with expression instead, we can do

df1 <- as.data.frame.matrix(table(path))
mapply(function(x, y) replace(x, x!= 0, y), df1, expr$expression)

#     gene1 gene2 gene3 gene4 gene5 gene6 gene8
#[1,]     1     2     3     4     0     0     0
#[2,]     1     0     0     0     5     0     0
#[3,]     0     0     3     0     0     6     0
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

It gives the wanted output, I know there is possibly simpler way but it works, tell me if it is quicker :

library(reshape)
df <- merge(path, expr, by = "gene",  all=T)
df <- t(cast(gene ~ pathway, data=df))

df <- df[-which(rownames(df) == "NA"),]
df[is.na(df)] <- 0
df
Rémi Coulaud
  • 1,684
  • 1
  • 8
  • 19
  • Thank you so much. However, I would have to tell you when facing a really large matrix, merge function would derive a rather long data frame, and then do cast...it is quite a catastrophe and my computer shut down twice just now lol... – Sugus Jun 09 '19 at 09:25
  • Thanks, I didn't know how fast was cast function. Slow apparently. – Rémi Coulaud Jun 09 '19 at 11:27