3

I have two matrix (A and B). I am trying to subset the matching rows from B with an interval value. For example,

Matrix A contains (I have more than 200 compounds)

Name   Mass.    RT.   Area.  ID

Asa.   234.032  1.56. 6755. Sd323

bda    164.041. 4.48. 5353. SD424

dsf.   353.953. 6.53. 2535. SD422

fed.   535.535. 5.14. 4542  SD424

Matrix B contains (similarly original matrix or CSV contains 5000 compounds)

Name. mass.      RT     Area. chemID pubID score

csa.  234.031    1.56.  4354.  frsg.  gss.  90

bda.  164.041.   4.78.  4346.  gsdg   gsf.  80

dwf.  432.035.   9.84.  4245.  grhr.  hfg.  99

fsf.  535.042.   7.01.  5353.  heth.  gww.  90

Now I want to subset the matching compounds from matrix B using Mass ± 0.001 and RT ± 0.5 interval and final matrix look like

Name. mass.      RT     Area. chemID pubID score

csa.  234.031    1.56.  4354.  frsg.  gss.  90

bda.  164.041.   4.78.  4346.  gsdg   gsf.  80

I tried with following commands in R and didnt work well. Any help is really appreciated.

#Read in first table
fname = "A.csv"
df1 = read.csv(fname)
# Read in the second table
fname = "B.xlsx"
df2 = read_excel(fname, skip=4)
# Create an empy dataframe
new_df = setNames(data.frame(matrix(ncol = ncol(df2), nrow = 0)), colnames(df2))
# Set the threshold for the mass and the retention time
m_ths = 1.e-3  # Mass threshold
rt_ths = 0.5  # Retention time threshold
# Loop over the indices of one of the data frames
for (i in 1:nrow(df1)) {
    # Get the mass and retention time of the current row
    m = df1$Mass[i]
    rt = df1$RT[i]
    # Get boolean vectors of rows within the second table that are within the 
    # given tolerance of the current mass (m) and retention time (rt)
    m_cond = df2$Mass >= m-m_ths & df2$Mass <= m+m_ths
    rt_cond = df2$RT >= rt-rt_ths & df2$RT <= rt + rt_ths
    # Get the subset of rows in second table that meet the required conditions
    tmp_df = subset(df2, m_cond & rt_cond)
    if (nrow(tmp_df) > 0) {
        # If the new table is not empty add it to the empty new_df data frame
        tmp_df$mb_data_index = i
        new_df = rbind(new_df, tmp_df)
    }
}
write.csv(new_df, "commoncompounds.csv")

1 Answers1

0

Code:

library('data.table')
# join two data tables and get only the matching rows by Name
df3 <- setDT(df2)[df1, on = 'Name', nomatch = 0]
# subset based on conditions of Mass and RT
df3 <- df3[ (round(abs(Mass - i.Mass), 3) <= 0.001) & 
            (round(abs(RT - i.RT), 1) <= 0.5), ]
# remove columns of df1
df3[, `:=` (i.Mass = NULL, i.RT = NULL, i.Area = NULL, ID = NULL)]
df3
#    Name    Mass   RT Area chemID pubID score
# 1:  Asa 234.031 1.56 4354   frsg   gss    90
# 2:  bda 164.041 4.78 4346   gsdg   gsf    80

Data:

df1 <- read.table(text = 
'Name   Mass    RT   Area  ID
Asa   234.032 1.56 6755 Sd323
bda   164.041 4.48 5353 SD424
dsf   353.953 6.53 2535 SD422
fed   535.535 5.14 4542 SD424', header = TRUE, stringsAsFactors = FALSE)

df2 <- read.table(text = 'Name Mass      RT     Area chemID pubID score
                  Asa  234.031    1.56  4354  frsg  gss  90
                  bda  164.041   4.78  4346  gsdg   gsf  80
                  dwf  432.035   9.84  4245  grhr  hfg  99
                  fsf  535.042  7.01  5353  heth  gww  90', header = TRUE, stringsAsFactors = FALSE)
Sathish
  • 12,453
  • 3
  • 41
  • 59
  • some of the compounds werent properly annotated, to get more accurate list, I need to subset based on Mz and RT with interval – Saravanan Devendran Mar 30 '20 at 19:32
  • can you name those compounds? `abs` will convert negative to positive values, so it will give the interval condition. – Sathish Mar 30 '20 at 19:34
  • I don't see Mz column in your data. Do you mean Mass – Sathish Mar 30 '20 at 19:35
  • Try this and see the output: `abs(-0.5); abs(0.5)` – Sathish Mar 30 '20 at 19:36
  • "not properly annotated" - Do you mean those dots next to each value in your data? The data I posted are nice and clean unlike the ones in your question. – Sathish Mar 30 '20 at 19:45
  • This is just an example matrix and not real compound names. real matrix has more than 5000 compounds to compare – Saravanan Devendran Mar 30 '20 at 20:43
  • I am thinking, maybe somethink like this -https://stackoverflow.com/questions/38426821/match-with-an-interval-and-extract-values-between-two-matrix-r# will work. But I am not an expert in R. maybe, you can help me. – Saravanan Devendran Mar 30 '20 at 20:47
  • You want to test for condition of RT between -0.5 to +0.5. Take this as example. What happens when you do `abs(-0.5)`? The `-0.5` will change to `+0.5`. Of course positive values never change their signs when used with `abs()`. What does it tell? We can now test for interval -0.5 to +0.5 by just checking the positive values. This is what happens in the code. I am writing this so you understand it, I guess. – Sathish Mar 30 '20 at 21:08
  • Try this: `df3[ , round(abs(Mass - i.Mass), 3) ]; df3[ , round(abs(RT - i.RT), 1) ]` – Sathish Mar 30 '20 at 21:10
  • Mass values of Asa: `(234.032 - 234.031); (234.031 - 234.032);` `abs(234.032 - 234.031); abs(234.031 - 234.032);` – Sathish Mar 30 '20 at 21:21
  • Thanks Sathish, But again, this is based on matching the compound name. is it possible to directly match the mass, RT with an interval? and then subset them from matrix 2?. – Saravanan Devendran Mar 31 '20 at 07:05
  • I disagree with you on "this is based on matching the compound name". First we match Names in two data frames, extract all matching rows and then we apply the condition. – Sathish Apr 01 '20 at 08:13
  • If you want to apply only those conditions for each data frame, then you need a threshold to compare against. May be, I am missing something here. Please explain more so I understand your need. Write it step by step what you want to accomplish. Please edit your question. If you have sufficient information in the question already, then wait for somebody to help you out. I did my best here. – Sathish Apr 01 '20 at 08:15