1

Searched a few different topics but am not finding the exact same question. I have a square correlation matrix where the row/column names are genes. Slice of the matrix shown below.

                Xelaev15073085m Xelaev15073088m Xelaev15073090m Xelaev15073095m
Xelaev15000002m       0.1250128      -0.6368677       0.3119062       0.3980826
Xelaev15000006m       0.4127414      -0.8805597       0.6435158       0.9629489
Xelaev15000007m       0.4012530      -0.8854113       0.6425895       0.9614517

I have a data frame which has pairs of genes I want to extract from this large matrix.

      V1              V2
1 Xelaev15011657m Xelaev15017932m
2 Xelaev15011587m Xelaev15046612m
3 Xelaev15011594m Xelaev15046616m
4 Xelaev15011597m Xelaev15046617m
5 Xelaev15011603m Xelaev15046624m
6 Xelaev15011654m Xelaev15017928m

I am trying to loop through the data frame and output the matrix cell of the pair matrix["gene1","gene2"] (for example the value 0.1250128 when comparing Xelaev15073085m and Xelaev15000002m). Doing this on a single gene basis is easy, however my attempt at a for loop to do this for the thousands of pairs in this list is failing. In the below example headedlist is a sample of the data frame above, and FullcorSM is the full correlation matrix.

for(i in headedlist$V1){
   data.frame(i, headedlist[i,2], FullcorSM[i,headedlist[i,2]])
}

The above line was my first attempt and returns null. My 2nd attempt is shown below.

for(i in 1:nrow(stagelist)){
  write.table(data.frame(stagelist$V1, stagelist$V2, FullcorSM["stagelist$V1","stagelist$V2"]),
              file="sampleout",
              sep="\t",quote=F)
}

Which returns an out of bounds error. To do the 2nd example without the quotes in the FullcorSM["stagelist$V1", "stagelist$V2"] section returns all values of the 2nd column for each of the first column, closer to what I want but still am missing some knowledge of how R is interpreting my matrix/data frame syntax, but it is not clear to me what the fix is. Any insight on how to proceed?

shadow
  • 21,823
  • 4
  • 63
  • 77
sessmurda
  • 167
  • 1
  • 2
  • 8
  • I suspect `match` and matrix indexing will allow you to do this in just a couple lines and with no loop. Hard to write code to help, though, without a [reproducible example](http://stackoverflow.com/q/5963269/210673). – Aaron left Stack Overflow Jan 06 '14 at 22:28
  • Nice trick about matrices: all you actually have to do is do `FullcorSM[as.matrix(headedlist)]` and you'll get a vector of the values right out. (*As long as*, that is, all the values in headedlist are actually present as column and/or row names in your matrix). – David Robinson Jan 06 '14 at 22:29
  • 1
    Also note that the reason your first attempt only returned `NULL` is because you didn't actually _do_ anything with each value selected within the loop. `for` loops in isolation are just functions (like most other things in R) and `NULL` is the default return value. It's up to you to _assign_ the values to something in order to save them. Otherwise, R will simply look at each value, say "Yep, it's there!" and move on. – joran Jan 06 '14 at 22:31
  • @Joran, thanks for that clarification, will be using David's solution but that will be good for me to know for future loops. – sessmurda Jan 07 '14 at 18:11
  • @sessmurda If David's answer worked for you, you should click on the little check mark next to it, so that future readers will know that his answer solved your problem. – joran Jan 07 '14 at 18:13

2 Answers2

5

The functionality you're trying to create is actually built into R. You can extract values from a matrix using another two-column matrix, where the first column represents the rownames and the second represents the column names. For example:

m = as.matrix(read.table(text="                Xelaev15073085m Xelaev15073088m Xelaev15073090m Xelaev15073095m
Xelaev15000002m       0.1250128      -0.6368677       0.3119062       0.3980826
Xelaev15000006m       0.4127414      -0.8805597       0.6435158       0.9629489
Xelaev15000007m       0.4012530      -0.8854113       0.6425895       0.9614517"))

# note that your subscript matrix has to be a matrix too, not a data frame
n = as.matrix(read.table(text="Xelaev15000002m Xelaev15073088m
Xelaev15000006m Xelaev15073090m"))

# then it's quite simple
print(m[n])
# [1] -0.6368677  0.6435158
David Robinson
  • 77,383
  • 16
  • 167
  • 187
  • Nice. For some reason I thought matrix indexing only worked with numerical indices, not characters, which would have made `match` needed. Guess not! – Aaron left Stack Overflow Jan 06 '14 at 22:37
  • +1! you should maybe note that you get an error `subscript out of bounds` if you try to subset with value that don't exist in the original matrix ( as in the OP example). – agstudy Jan 06 '14 at 22:39
  • +1! A stupid question perhaps... Does it matter which are row names and which are column names in m? I.e. would it work on 'the same' correlation matrix but that happened to be transposed? – Henrik Jan 06 '14 at 22:44
  • @Henrik: It does matter unless the matrix is symmetrical – David Robinson Jan 07 '14 at 15:31
  • Thanks! Nice little trick that I will definitely be utilizing in the future. R is full of so many nifty shortcuts – sessmurda Jan 07 '14 at 18:10
2

Far from as clean as @David Robinson's very nice solution. Anyway, here it doesn't matter which genes that are in rows and which are in columns in the correlation matrix, and if the subscript matrix contains combinations not in the correlation matrix. Same matrix names as in @David's solution:

# combinations of row and column names for original and transposed correlation matrix
m_comb <- c(outer(rownames(m), colnames(m), paste),
            outer(rownames(t(m)), colnames(t(m)), paste))

# 'dim names' in subscript matrix
n_comb <- paste(n[, "V1"], n[, "V2"])

# subset
m[n[n_comb %in% m_comb, ]]
# [1] -0.6368677  0.6435158

Update

Another possibility, slightly more convoluted but perhaps a more useful output. First read the correlation matrix to a data frame df, and the subscript matrix to a data frame df2.

# add row names as a column in correlation matrix
df$rows <- rownames(df)

# melt the correlation matrix
library(reshape2)
df3 <- melt(df)

# merge subscript data and correlation data
df4 <- merge(x = df2, y = df3, by.x = c("V1", "V2"), by.y = c("rows", "variable"))
df4
#                V1              V2      value
# 1 Xelaev15000002m Xelaev15073088m -0.6368677
# 2 Xelaev15000006m Xelaev15073090m  0.6435158
Henrik
  • 65,555
  • 14
  • 143
  • 159