I have two matrix (A and B). I am trying to subset the matching rows from B with an interval value. For example,
Matrix A contains (I have more than 200 compounds)
Name Mass. RT. Area. ID
Asa. 234.032 1.56. 6755. Sd323
bda 164.041. 4.48. 5353. SD424
dsf. 353.953. 6.53. 2535. SD422
fed. 535.535. 5.14. 4542 SD424
Matrix B contains (similarly original matrix or CSV contains 5000 compounds)
Name. mass. RT Area. chemID pubID score
csa. 234.031 1.56. 4354. frsg. gss. 90
bda. 164.041. 4.78. 4346. gsdg gsf. 80
dwf. 432.035. 9.84. 4245. grhr. hfg. 99
fsf. 535.042. 7.01. 5353. heth. gww. 90
Now I want to subset the matching compounds from matrix B using Mass ± 0.001 and RT ± 0.5 interval and final matrix look like
Name. mass. RT Area. chemID pubID score
csa. 234.031 1.56. 4354. frsg. gss. 90
bda. 164.041. 4.78. 4346. gsdg gsf. 80
I tried with following commands in R and didnt work well. Any help is really appreciated.
#Read in first table
fname = "A.csv"
df1 = read.csv(fname)
# Read in the second table
fname = "B.xlsx"
df2 = read_excel(fname, skip=4)
# Create an empy dataframe
new_df = setNames(data.frame(matrix(ncol = ncol(df2), nrow = 0)), colnames(df2))
# Set the threshold for the mass and the retention time
m_ths = 1.e-3 # Mass threshold
rt_ths = 0.5 # Retention time threshold
# Loop over the indices of one of the data frames
for (i in 1:nrow(df1)) {
# Get the mass and retention time of the current row
m = df1$Mass[i]
rt = df1$RT[i]
# Get boolean vectors of rows within the second table that are within the
# given tolerance of the current mass (m) and retention time (rt)
m_cond = df2$Mass >= m-m_ths & df2$Mass <= m+m_ths
rt_cond = df2$RT >= rt-rt_ths & df2$RT <= rt + rt_ths
# Get the subset of rows in second table that meet the required conditions
tmp_df = subset(df2, m_cond & rt_cond)
if (nrow(tmp_df) > 0) {
# If the new table is not empty add it to the empty new_df data frame
tmp_df$mb_data_index = i
new_df = rbind(new_df, tmp_df)
}
}
write.csv(new_df, "commoncompounds.csv")