1

I have very big reference file with thousands of pairwise comparisons between thousands of objects ("OTUs). The dataframe is in long format:

data.frame':    14845516 obs. of  3 variables:
 $ OTU1   : chr  "0" "0" "0" "0" ...
 $ OTU2   : chr  "8192" "1" "8194" "3" ...
 $ gendist: num  78.7 77.8 77.6 74.4 75.3 ...

I also have a much smaller subset with observed data (slightly different structure):

'data.frame':   286903 obs. of  3 variables:
 $ OTU1   : chr  "1239" "1603" "2584" "1120" ...
 $ OTU2   : chr  "12136" "12136" "12136" "12136" ...
 $ ecodist: num  2.08 1.85 2 1.73 1.53 ...
 - attr(*, "na.action")=Class 'omit'  Named int [1:287661] 1 759 760 1517 1518 1519 2275 2276 2277 2278 ...
  .. ..- attr(*, "names")= chr [1:287661] "1" "759" "760" "1517" ...

Again, its a pairwise comparison of objects ('OTUs'). All objects in the smaller dataset are also in the reference dataset.

I want to reduce the reference that it only contains objects that are also found in the smaller dataset. It is very important that its done on both columns (OTU1, OTU2).

Here is toy data:

library(reshape)
###reference
Ref <- cor(as.data.frame(matrix(rnorm(100),10,10)))
row.names(Ref) <- colnames(Ref) <- LETTERS[1:10]
Ref[upper.tri(Ref)] <- NA
diag(Ref) <- NA
Ref.m <- na.omit(melt(Ref, varnames = c('row', 'col')))
###query
tmp <- cor(as.data.frame(matrix(rnorm(25),5,5)))
row.names(tmp) <- colnames(tmp) <- LETTERS[1:5]
tmp[upper.tri(tmp)] <- NA
diag(tmp) <- NA
tmp.m <- na.omit(melt(tmp, varnames = c('row', 'col')))
nouse
  • 3,315
  • 2
  • 29
  • 56

1 Answers1

1

The following works for me using your toy data:

Ref[rownames(tmp), colnames(tmp)]

This selects (by name) only those rows in Ref whose names are also the names of rows in tmp, and likewise for columns.

If you want to stick with the long format in the str outputs in the first part of your question, you can instead use something like:

data1[(data1$OTU1 %in% data2$OTU1) & (data1$OTU2 %in% data2$OTU2), ]

Here I'm creating a logical vector that indicates which rows of your reference data frame (data1) have their OTU1 entry somewhere in data2$OTU1, and the same for OTU2. Said logical vector is then used to select rows of data1.

Empiromancer
  • 3,778
  • 1
  • 22
  • 53