match with an interval and extract values between two matrix R

Question

I have n matrix in a list and an additional matrix which contain the value I want to find in the list of matrix.

To get the list of matrix, I use this code :

setwd("C:\\~\\Documents\\R") 


import.multiple.txt.files<-function(pattern=".txt",header=T)
{
list.1<-list.files(pattern=".txt")
list.2<-list()
for (i in 1:length(list.1))
{
list.2[[i]]<-read.delim(list.1[i])
}
names(list.2)<-list.1
list.2

}


txt.import.matrix<-cbind(txt.import)

My list look like that: (I show only an example with n=2). The number of rows in each array is different (here I just take 5 and 6 rows to simplify but I have in my true data more than 500).

txt.import.matrix[1]

    [[1]]
     X.     RT.     Area.  m.z.      
1     1     1.01   2820.1  358.9777  
2     2     1.03   9571.8  368.4238  
3     3     2.03   6674.0  284.3294  
4     4     2.03   5856.3  922.0094  
5     5     3.03   27814.6 261.1299  


txt.import.matrix[2]

    [[2]]
     X.     RT.     Area.  m.z.      
1     1     1.01    7820.1 358.9777  
2     2     1.06    8271.8 368.4238  
3     3     2.03   12674.0 284.3294  
4     4     2.03    5856.6 922.0096  
5     5     2.03   17814.6 261.1299
6     6     3.65    5546.5 528.6475

I have another array of values I want to find in the list of matrix. This array was obtained by combine all the array from the list in an array and removing the duplicates.

reduced.list.pre.filtering

     RT.   m.z.
1    1.01  358.9777
2    1.07  368.4238
3    2.05  284.3295
4    2.03  922.0092
5    3.03  261.1299
6    3.56  869.4558

I would like to obtain a new matrix where it is written the Area. result of matched RT. ± 0.02 and m.z. ± 0.0002 for all the matrix in the list. The output could be like that.

     RT.   m.z.        Area.[1]      Area.[2]
1    1.01  358.9777    2820.1        7820.1
2    1.07  368.4238                  8271.8      
3    2.05  284.3295    6674.0        12674.0
4    2.03  922.0092    5856.3             
5    3.03  261.1299    27814.6            
6    3.65  528.6475

I have only an idea how to match only one exact value in one array. The difficulty here is to find the value in a list of array and need to find the value ± an interval. If you have any suggestion, I will be very grateful.

score 3 · Answer 1 · answered Jul 18 '16 at 11:05

Using non-equi joins from current development version of data.table, v1.9.7 (See installation instructions), which allows non-equi conditions to be provided to joins:

require(data.table) # v1.9.7
names(ll) = c("Area1", "Area2")
A = rbindlist(lapply(ll, as.data.table), idcol = "id")           ## (1)

B = as.data.table(mat)
B[, c("RT.minus", "RT.plus") := .(RT.-0.02, RT.+0.02)]
B[, c("m.z.minus", "m.z.plus") := .(m.z.-0.0002, m.z.+0.0002)]   ## (2)

ans = A[B, .(id, X., RT. = i.RT., m.z. = i.m.z., Area.), 
           on = .(RT. >= RT.minus, RT. <= RT.plus, 
                  m.z. >= m.z.minus, m.z. <= m.z.plus)]          ## (3)

dcast(ans, RT. + m.z. ~ id)                                      ## (4)
# or dcast(ans, RT. + m.z. ~ id, fill = 0)
#     RT.     m.z.   Area1   Area2
# 1: 1.01 358.9777  2820.1  7820.1
# 2: 1.07 368.4238      NA  8271.8
# 3: 2.03 922.0092  5856.3      NA
# 4: 2.05 284.3295  6674.0 12674.0
# 5: 3.03 261.1299 27814.6      NA

[1] Name the list of matrices (called ll here) and convert each of them to a data.table using lapply(), and bind them row-wise using rbindlist, and add the names as an extra column (idcol). Call it A.

[2] Convert the second matrix (called mat here) to data.table as well. Add additional columns corresponding to the ranges/intervals you want to search for (since the on= argument, as we'll see in the next step, can't handle expressions yet). Call it B.

[3] Perform the conditional join/subset. For each row in B, find the matching rows in A corresponding to the condition provided to on= argument, and extract the columns id, X., R.T. and m.z. for those matching row indices.

[4] It's better to leave it at [3]. But if you'd like it as shown in your answer, we can reshape it into wide format. fill = 0 would replace NAs in the result with 0.

score 2 · Accepted Answer · edited May 23 '17 at 11:46

This is an alternative approach to Arun's rather elegant answer using data.table. I decided to post it because it contains two additional aspects that are important considerations in your problem:

Floating point comparison: comparison to see if a floating point value is in an interval requires consideration of the round-off error in computing the interval. This is the general problem of comparing floating point representations of real numbers. See this and this in the context of R. The following implements this comparison in the function in.interval.
Multiple matches: your interval match criterion can result in multiple matches if the intervals overlap. The following assumes that you only want the first match (with respect to increasing rows of each txt.import.matrix matrix). This is implemented in the function match.interval and explained in the notes to follow. Other logic is needed if you want to get something like the average of the areas that match your criterion.

To find the matching row(s) in a matrix from txt.import.matrix for each row in the matrix reduced.list.pre.filtering, the following code vectorizes the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering and the matrix from txt.import.matrix. Functionally for this application, this is the same as Arun's solution using data.table's non-equi joins; however, the non-equi join feature is more general and the data.table implementation is most likely better optimized for both memory usage and speed for even this application.

in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
  return (abs(x-center) <= (deviation + tol))
}

match.interval <- function(r, t) {
  r.rt <- rep(r[,1], each=nrow(t))
  t.rt <- rep(t[,2], times=nrow(r))
  r.mz <- rep(r[,2], each=nrow(t))
  t.mz <- rep(t[,4], times=nrow(r))                                       ## 1.

  ind <- which(in.interval(r.rt, t.rt, 0.02) & 
               in.interval(r.mz, t.mz, 0.0002))
  r.ind <- floor((ind - 1)/nrow(t)) + 1                                   ## 2.

  dup <- duplicated(r.ind)
  r.ind <- r.ind[!dup]
  t.ind <- ind[!dup] - (r.ind - 1)*nrow(t)                                ## 3.
  return(cbind(r.ind,t.ind))                       
}

get.area.matched <- function(r, t) {
  match.ind <- match.interval(r, t)
  area <- rep(NA,nrow(r))
  area[match.ind[,1]] <- t[match.ind[,2], 3]                              ## 4.
  return(area)
}

res <- cbind(reduced.list.pre.filtering,
             do.call(cbind,lapply(txt.import.matrix, 
                                  get.area.matched, 
                                  r=reduced.list.pre.filtering)))         ## 5.
colnames(res) <- c(colnames(reduced.list.pre.filtering), 
                   sapply(seq_len(length(txt.import.matrix)), 
                          function(i) {return(paste0("Area.[",i,"]"))}))  ## 6.
print(res)
##      RT.     m.z. Area.[1] Area.[2]
##[1,] 1.01 358.9777   2820.1   7820.1
##[2,] 1.07 368.4238       NA   8271.8
##[3,] 2.05 284.3295   6674.0  12674.0
##[4,] 2.03 922.0092   5856.3       NA
##[5,] 3.03 261.1299  27814.6       NA
##[6,] 3.56 869.4558       NA       NA

Notes:

This part constructs the data to enable the vectorization of the application of the comparison function over the space of all enumerated pairs of rows between reduced.list.pre.filtering and the matrix from txt.import.matrix. The data to be constructed are four arrays that are the replications (or expansions) of the two columns, used in the comparison criterion, of reduced.list.pre.filtering in the row dimension of each matrix from txt.import.matrix and the replications of the two columns, used in the comparison criterion, of each matrix from txt.import.matrix in the row dimension of reduced.list.pre.filtering. Here, the term array refers to either a 2-D matrix or a 1-D vector. The resulting four arrays are:
- r.rt is the replication of the RT. column of reduced.list.pre.filtering (i.e., r[,1]) in the row dimension of t
- t.rt is the replication of the RT. column of the matrix from txt.import.matrix (i.e., t[,2]) in the row dimension of r
- r.mz is the replication of the m.z. column of reduced.list.pre.filtering (i.e. r[,2]) in the row dimension of t
- t.mz is the replication of the m.z. column of the matrix from txt.import.matrix (i.e. t[,4]) in the row dimension of r
What is important is that the indices for each of these arrays enumerate all pairs of rows in r and t in the same manner. Specifically, viewing these arrays as 2-D matrices of size M by N where M=nrow(t) and N=nrow(r), the row indices correspond to the rows of t and the column indices correspond to the rows of r. Consequently, the array values (over all four arrays) at the i-th row and the j-th column (of each of the four arrays) are the values used in the comparison criterion between the j-th row of r and the i-th row of t. Implementation of this replication process uses the R function rep. For example, in computing r.rt, rep with each=M is used, which has the effect of treating its array input r[,1] as a row vector and replicating that row M times to form M rows. The result is such that each column, which corresponds to a row in r, has the RT. value from the corresponding row of r and that value is the same for all rows (of that column) of r.rt, each of which corresponds to a row in t. This means that in comparing that row in r to any row in t, the value of RT. from that row in r is used. Conversely, in computing t.rt, rep with times=N is used, which has the effect of treating its array input as a column vector and replicating that column N times to form a N columns. The result is such that each row in t.rt, which corresponds to a row in t, has the RT. value from the corresponding row of t and that value is the same for all columns (of that row) of t.rt, each of which corresponds to a row in r. This means that in comparing that row in t to any row in r, the value of RT. from that row in t is used. Similarly, the computations of r.mz and t.mz follow using the m.z. column from r and t, respectively.
This performs the vectorized comparison resulting in a M by N logical matrix where the i-th row and the j-th column is TRUE if the j-th row of r matches the criterion with the i-th row of t, and FALSE otherwise. The output of which() is the array of array indices to this logical comparison result matrix where its element is TRUE. We want to convert these array indices to the row and column indices of the comparison result matrix to refer back to the rows of r and t. The next line extracts the column indices from the array indices. Note that the variable name is r.ind to denote that these correspond to the rows of r. We extract this first because it is important for detecting multiple matches for a row in r.
This part handles possible multiple matches in t for a given row in r. Multiple matches will show up as duplicate values in r.ind. As stated above, the logic here only keeps the first match in terms of increasing rows in t. The function duplicated returns all the indices of duplicate values in the array. Therefore removing these elements will do what we want. The code first removes them from r.ind, then it removes them from ind, and finally computes the column indices to the comparison result matrix, which corresponds to the rows of t, using the pruned ind and r.ind. What is returned by match.interval is a matrix whose rows are matched pair of row indices with its first column being row indices to r and its second column being row indices to t.
The get.area.matched function simply uses the result from match.ind to extract the Area from t for all matches. Note that the returned result is a (column) vector with length equaling to the number of rows in r and initialized to NA. In this way rows in r that has no match in t has a returned Area of NA.
This uses lapply to apply the function get.area.matched over the list txt.import.matrix and append the returned matched Area results to reduced.list.pre.filtering as column vectors. Similarly, the appropriate column names are also appended and set in the result res.

Edit: Alternative implementation using the foreach package

In hindsight, a better implementation uses the foreach package for vectorizing the comparison. In this implementation, the foreach and magrittr packages are required

require("magrittr")  ## for %>%
require("foreach")

Then the code in match.interval for vectorizing the comparison

r.rt <- rep(r[,1], each=nrow(t))
t.rt <- rep(t[,2], times=nrow(r))
r.mz <- rep(r[,2], each=nrow(t))
t.mz <- rep(t[,4], times=nrow(r))                       # 1.

ind <- which(in.interval(r.rt, t.rt, 0.02) & 
             in.interval(r.mz, t.mz, 0.0002))

can be replaced by

ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:% 
         foreach(t.row = 1:nrow(t)) %do% 
           match.criterion(r.row, t.row, r, t) %>% 
             as.logical(.) %>% which(.)

where the match.criterion is defined as

match.criterion <- function(r.row, t.row, r, t) {
  return(in.interval(r[r.row,1], t[t.row,2], 0.02) & 
         in.interval(r[r.row,2], t[t.row,4], 0.0002))
}

This is easier to parse and reflects what is being performed. Note that what is returned by the nested foreach combined with cbind is again a logical matrix. Finally, the application of the get.area.matched function over the list txt.import.matrix can also be performed using foreach:

res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do% 
         get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
           cbind(reduced.list.pre.filtering,.)

The complete code using foreach is as follows:

require("magrittr")
require("foreach")

in.interval <- function(x, center, deviation, tol = .Machine$double.eps^0.5) {
  return (abs(x-center) <= (deviation + tol))
}

match.criterion <- function(r.row, t.row, r, t) {
  return(in.interval(r[r.row,1], t[t.row,2], 0.02) & 
     in.interval(r[r.row,2], t[t.row,4], 0.0002))
}

match.interval <- function(r, t) {
  ind <- foreach(r.row = 1:nrow(r), .combine=cbind) %:% 
       foreach(t.row = 1:nrow(t)) %do% 
     match.criterion(r.row, t.row, r, t) %>% 
       as.logical(.) %>% which(.)
  # which returns 1-D indices (row-major),
  # convert these to 2-D indices in (row,col)
  r.ind <- floor((ind - 1)/nrow(t)) + 1                   ## 2.
  # detect duplicates in r.ind and remove them from ind
  dup <- duplicated(r.ind)
  r.ind <- r.ind[!dup]
  t.ind <- ind[!dup] - (r.ind - 1)*nrow(t)                ## 3.

  return(cbind(r.ind,t.ind))                       
}

get.area.matched <- function(r, t) {
  match.ind <- match.interval(r, t)
  area <- rep(NA,nrow(r))
  area[match.ind[,1]] <- t[match.ind[,2], 3]
  return(area)
}

res <- foreach(i = 1:length(txt.import.matrix), .combine=cbind) %do% 
     get.area.matched(reduced.list.pre.filtering, txt.import.matrix[[i]]) %>%
       cbind(reduced.list.pre.filtering,.)

colnames(res) <- c(colnames(reduced.list.pre.filtering), 
           sapply(seq_len(length(txt.import.matrix)), 
              function(i) {return(paste0("Area.[",i,"]"))}))

Hope this helps.

Thank you Aichao ! Indeed your code is a little bit complicated for me but it is a good challenge to try to understand it. It's run very well, I get the output I want. Yes it's help a lot ! — Van3, Jul 18 '16 at 18:53
@Vanbell: Thanks. I'm still editing the post. Check back tomorrow and hopefully it will be clearer. Of course, if you have any questions, please leave a comment. Again, I thought I should post to point out the two possible aspects in your problem that I thought are important. — aichao, Jul 18 '16 at 19:09
Okay, I am waiting for your post. I thing, i will ask you some questions, because it is too easy to copy and past ;) — Van3, Jul 18 '16 at 19:32
I double check the results and I do not get what I expected. My data are little bit more complicated. Indeed, I do not have the same number of row for all the matrix which compose `txt.import.matrix`. I think I have to change this part of the code `sapply(seq_len(length(txt.import.matrix)), function(i) {return(paste0("Area[",i,"]"))}))` I tried `function(i) { sapply(seq_len(length(txt.import.matrix[i])), return(paste0("Area[",i,"]"))}))` but it is not working. I don't understand because it should take in consideration the length of each matrix of the list. — Van3, Jul 19 '16 at 03:00
@Vanbell: can you update your question with more data that illustrate the problem? Please use `dput()` so that I can reconstitute it quickly. In the meantime, I've updated the post so it may be more clear to you. I also intent to edit again to add another implementation using the `foreach` package. — aichao, Jul 19 '16 at 03:22
@Vanbell: also, can you make sure that the code you are running is the same as that is posted now. I was making minor tweaks that did not make a difference in the results with the original data you posted when you first made the comment that it worked for you. — aichao, Jul 19 '16 at 03:29
@Vanbell: Ok. I found a bug. I used `now(t)` instead of `now(r)` and conversely. It did not matter if the number of rows for `r` and `t` are the same. Looked at the latest update. — aichao, Jul 19 '16 at 03:57
@Vanbell: sorry about the confusion last night. It should work as you expect now. I've updated the post to reflect the new results, but code in the post from this morning should work (no changes there). Please let me know if you have any questions understanding. Thanks. — aichao, Jul 19 '16 at 15:00
Hello aichao, don't be sorry you help me a LOT ! I confirm you that the code work perfectly ! I already double check I can now make a lot of statistics. — Van3, Jul 19 '16 at 15:56
Alright thank you for the explanation. I have to continue to practice a lot to succeed to do something like that by myself. I will try the `foreach` package with my data and compare the results. I keep you informed. — Van3, Jul 20 '16 at 13:54
Hello, for whatever reasons, I did not succeed to apply the second version you propose. However, thanks to your nice description I succeeded to add an other criterion. Now I am trying to sum the duplicate instead of remove it. I keep you informed. — Van3, Jul 27 '16 at 19:42
@Vanbell can you tell me what happened with the `foreach` implementation? It seems to give me the same results. — aichao, Jul 28 '16 at 15:28
I retry today, I don't know why, but when I whant to use foreach package, R don't stop the calculation and I have no result. Is it because my matrix are big ? 3248 obs. of 6 variables ? — Van3, Aug 01 '16 at 02:03
@Vanbell: No, your data is not too big. I'll retest and edit the post today to try to make the explanation more clear. — aichao, Aug 01 '16 at 11:02
Thank you. I am just starting with R, that is why I am subjected to make mistake more easily. — Van3, Aug 01 '16 at 15:07
@Vanbell Sorry for the delay in getting back to you. I rechecked my code, and everything worked as before on the data you posted. I've updated the post with the complete code using `foreach`. Please verify on the original data that you posted, which should give the same result before testing on your full data. If it still does not work, please let me know. Thanks. — aichao, Aug 06 '16 at 02:48
@ aichao Hi, Sorry for my delay now. I made a little trip in the American health system. I try again the script. I think is more clear but I still have the followed error message: Error in eval(expr, envir, enclos) : could not find function "%do%". I found in http://stackoverflow.com/questions/27560450/what-do-dot-and-percentage-mean-in-r some explanation on the use of percentage. Is it because i forgot to defind %do% ? — Van3, Sep 14 '16 at 17:37
@Vanbell: I believe that error has to do with not loading the `library(foreach)` and `library(magrittr)` before issuing the commands. — aichao, Sep 14 '16 at 18:00
@ aichao, I try again but I still have the error. I am using Rstudio, and I can see the packages is checked. I will try after restart my r session. — Van3, Sep 14 '16 at 18:11

score 1 · Answer 3 · answered Jul 18 '16 at 00:52

This is a quick rough approach that might help, if I get what you're trying to do.

Unlist values from each variable of two matrices

areas <- unlist(lapply(txt.import.matrix, function(x) x$Area.))
rts <- unlist(lapply(txt.import.matrix, function(x) x$RT.))
mzs <- unlist(lapply(txt.import.matrix, function(x) x$m.z.))

Find indices of those values of RT and m.z. that are closest to value in third matrix/df:

 rtmins <- lapply(reduced.list.pre.filtering$RT., function(x) which(abs(rts-x)==min(abs(rts-x))))
mzmins <- lapply(reduced.list.pre.filtering$m.z., function(x) which(abs(mzs-x)==min(abs(mzs-x))))

Use purrr to quickly calculate which indices are in both (i.e. minimum difference for each):

inboth <- purrr::map2(rtmins,mzmins,intersect)

Get corresponding area value:

vals<-lapply(inboth, function(x) areas[x])

Use reshape2 to put into wide format:

vals2 <- reshape2::melt(vals)
vals2$number <- ave(vals2$L1, vals2$L1, FUN = seq_along)
vals.wide <-reshape2::dcast(vals2, L1 ~ number, value.var="value")

cbind(reduced.list.pre.filtering, vals.wide)

#   RT.     m.z. L1       1       2
#1 1.01 358.9777  1  2820.1  7820.1
#2 1.07 368.4238  2  8271.8      NA
#3 2.05 284.3295  3  6674.0 12674.0
#4 2.03 922.0092  4  5856.3      NA
#5 3.03 261.1299  5 27814.6      NA

This might give you some ideas. Could be easily adapted to check if shared minimum values exceed +/- a value.

Thank you, I keep you informed as soon as I succeed to apply it. — Van3, Jul 18 '16 at 02:03

match with an interval and extract values between two matrix R

3 Answers3

Linked