1

How to merge information of two matrices where one is a sub matrice of another but having different information.

I have two matrices (1223x1223) and (7096x7096). Bot these matrices hold drugbank drug-drug distance scores ranging from 0 to 1. So the bigger matrix holds the chemical structural similarity of drugs and the smaller one holds a different similarity score.

I want to know how can I merge these two into a single matrix (data fusion) to get the information of both matrices. So, if a Drug 1 and Drug 2 have scores 0.5 and 0.7 in two matrices, what is the best way to fuse the data so that I do not lose the information.

This is an example of my data:

Data1
          DB00006   DB00014   DB00035   DB00050   DB00091   DB00093   DB00104   DB00115
DB00006 1.0000000 0.8139535 0.8205128 0.7976190 0.6075949 0.6835443 0.6547619 0.6666667
DB00014 0.8139535 1.0000000 0.7500000 0.8111111 0.5617978 0.6292135 0.6966292 0.7200000
DB00035 0.8205128 0.7500000 1.0000000 0.7325581 0.5243902 0.8450704 0.7564103 0.6122449
DB00050 0.7976190 0.8111111 0.7325581 1.0000000 0.5764706 0.6091954 0.6976744 0.6700000
DB00091 0.6075949 0.5617978 0.5243902 0.5764706 1.0000000 0.4871795 0.5250000 0.5543478
DB00093 0.6835443 0.6292135 0.8450704 0.6091954 0.4871795 1.0000000 0.8028169 0.5360825
DB00104 0.6547619 0.6966292 0.7564103 0.6976744 0.5250000 0.8028169 1.0000000 0.5816327

Data2
            DB07768    DB07886    DB07702    DB07465    DB08567    DB07129    DB08298
DB00014 0.260115607 0.19402985 0.22112211 0.11636364 0.26256983 0.18936877 0.29700855
DB00035 0.176344086 0.19935691 0.19545455 0.15606936 0.21489362 0.19523810 0.23456790
DB00050 0.037470726 0.05490196 0.05298013 0.09090909 0.03318584 0.05755396 0.03664921
DB00091 0.211974110 0.21052632 0.14814815 0.11666667 0.28192372 0.15856777 0.32452830
DB00104 0.200686106 0.20642202 0.15877437 0.12420382 0.26795096 0.19174041 0.31653226
DB00122 0.002469136 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000 0.00000000

Data

Data1 <- 
structure(c(1, 0.813953488, 0.820512821, 0.797619048, 0.607594937, 
0.683544304, 0.654761905, 0.813953488, 1, 0.75, 0.811111111, 
0.561797753, 0.629213483, 0.696629213, 0.820512821, 0.75, 1, 
0.73255814, 0.524390244, 0.845070423, 0.756410256, 0.797619048, 
0.811111111, 0.73255814, 1, 0.576470588, 0.609195402, 0.697674419, 
0.607594937, 0.561797753, 0.524390244, 0.576470588, 1, 0.487179487, 
0.525, 0.683544304, 0.629213483, 0.845070423, 0.609195402, 0.487179487, 
1, 0.802816901, 0.654761905, 0.696629213, 0.756410256, 0.697674419, 
0.525, 0.802816901, 1, 0.666666667, 0.72, 0.612244898, 0.67, 
0.554347826, 0.536082474, 0.581632653), .Dim = 7:8, .Dimnames = list(
    c("DB00006", "DB00014", "DB00035", "DB00050", "DB00091", 
    "DB00093", "DB00104"), c("DB00006", "DB00014", "DB00035", 
    "DB00050", "DB00091", "DB00093", "DB00104", "DB00115")))

Data2 <-
structure(c(0.260115607, 0.176344086, 0.037470726, 0.21197411, 
0.200686106, 0.002469136, 0.194029851, 0.199356913, 0.054901961, 
0.210526316, 0.206422018, 0, 0.221122112, 0.195454545, 0.052980132, 
0.148148148, 0.158774373, 0, 0.116363636, 0.156069364, 0.090909091, 
0.116666667, 0.124203822, 0, 0.262569832, 0.214893617, 0.033185841, 
0.281923715, 0.267950963, 0, 0.189368771, 0.195238095, 0.057553957, 
0.158567775, 0.191740413, 0, 0.297008547, 0.234567901, 0.036649215, 
0.324528302, 0.316532258, 0), .Dim = 6:7, .Dimnames = list(c("DB00014", 
"DB00035", "DB00050", "DB00091", "DB00104", "DB00122"), c("DB07768", 
"DB07886", "DB07702", "DB07465", "DB08567", "DB07129", "DB08298"
)))
user20650
  • 24,654
  • 5
  • 56
  • 91
Anurag
  • 21
  • 4
  • Perhaps easiest to reshape both matrices to long format (see `reshape2::melt`), so you have three columns - drug1, drug2, and value. Then merge these (see `merge`). – user20650 May 24 '16 at 11:13
  • so the merge will simply add the values from drug1-drug2 relations? – Anurag May 24 '16 at 11:18
  • There is no best way to merge them: it depends on what you want to do next with the data arrays. How come your matrices have different dimensions? I assume every row and columns represents a single drug? Why then is one of the matrices larger than the other? Is this because the smaller dataset contains information that is unknown for some drugs? – Erik May 24 '16 at 11:30
  • Actually the smaller matrix has a different data from 0 to 1 and only a subset of the bigger matrix had that kind of data. So, technically its a sort of data fusion. – Anurag May 24 '16 at 11:36
  • Show a snippet of your data sets, then it would be easier to help. You can use `dput(head(your_data))` – Roman May 24 '16 at 11:37
  • 1
    @Anurag ; if you melt both matrices, you will have two three column dataframes. You can then merge these by the first two columns columns (which will have the drug names from the matrix columns and rows). For more specific advice, you should edit your question with a [small, reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – user20650 May 24 '16 at 11:42
  • 1
    I just shared a small set of data. Kindly check. – Anurag May 24 '16 at 12:19
  • @Anurag You deleted your question, and your explanation. Please include it again. – Roman May 24 '16 at 12:28
  • 1
    @user20650 - I want in R. Thank you – Anurag May 24 '16 at 12:34
  • @Anurag ; assuming you want every value from your matrices used you can do `library(reshape2) ; m1 <- melt(Data1); m2 <- melt(Data2) ; merge(m1, m2, by=c("Var1", "Var2"), all=TRUE) `. Howver, your `Data1` matrix look as if it is symmetric, and so you will have a lot of replication (with the drug names reversed). If you do not want this replication , you can remove the values in the lower half of the matrix (also did not use the values in the diagonal): ie use `Data1[lower.tri(Data1, diag=TRUE)] <- NA ; m1 <- melt(Data1, na.rm=TRUE)` then continue as before. – user20650 May 26 '16 at 10:49

1 Answers1

0

Using base R (essentially a reshape trick):

#convert to sparse format
DF_1 <- data.frame(cbind(which(!is.na(Data1), arr.ind = T), val_1 = Data1[!is.na(Data1)]))
DF_2 <- data.frame(cbind(which(!is.na(Data2), arr.ind = T), val_2 = Data1[!is.na(Data2)]))

#add row and column names
DF_1[, "row"] <- row.names(Data1)[DF_1[, "row"]]
DF_1[, "col"] <- colnames(Data1)[DF_1[, "col"]]

DF_2[, "row"] <- row.names(Data2)[DF_2[, "row"]]
DF_2[, "col"] <- colnames(Data2)[DF_2[, "col"]]

#merge without losing information from either matrix
merge(DF_1, DF_2, by = c("row","col"), all.x = T, all.y = T)

          row     col     val_1     val_2
1  DB00006 DB00006 1.0000000        NA
2  DB00006 DB00014 0.8139535        NA
3  DB00006 DB00035 0.8205128        NA
4  DB00006 DB00050 0.7976190        NA
5  DB00006 DB00091 0.6075949        NA
...
93 DB00122 DB07465        NA 0.7325581
94 DB00122 DB07702        NA 0.7325581
95 DB00122 DB07768        NA 0.6835443
96 DB00122 DB07886        NA 0.5617978
97 DB00122 DB08298        NA 0.8028169
98 DB00122 DB08567        NA 0.5617978

Note that if both your matrices are symmetric, you can be much more clever with:

DF_1 <- data.frame(cbind(which(lower.tri(Data1), arr.ind = T), val_1 = Data1[lower.tri(Data1)]))
DF_2 <- data.frame(cbind(which(lower.tri(Data2), arr.ind = T), val_2 = Data1[lower.tri(Data2)]))

as the first line

Chris
  • 6,302
  • 1
  • 27
  • 54
  • Thanks Chris, I think I do need the lower triangle as my data is symmetric and when i did not use it, i got duplicates and my laptop crashed. I will try this solution and get back. – Anurag May 25 '16 at 01:11
  • Hey, after the merge, I need to add the values of duplicate rows and remove the individual rows. Can I use aggregate or ddply for that? – Anurag May 25 '16 at 01:50
  • I tried the solution but ran into memory error. Is there a memory efficient way of doing this? – Anurag May 25 '16 at 09:23
  • @Anurag what step is causing the error, and how much RAM do you have? – Chris May 25 '16 at 23:23
  • The merge step is causing an error. I have 8Gb of RAM – Anurag May 26 '16 at 11:16
  • @Anurag either try data.table instead of data.frame, or use SQL instead. The largest possible data.frame that could come from this is 26 million rows (if there are no matches between the two), which potentially may not fit in memory. You also could try converting the ids to factors (DF_1[, "row"] <- as.factor(row.names(Data1)[DF_1[, "row"]]) – Chris May 26 '16 at 15:25
  • Thanx I will try that. – Anurag May 27 '16 at 16:09