How to merge two data frames based on similar values in R

Question

I'm relatively new in R and I have a question about merging two data frames, which does contain similar numeric data from two domains (mz and rt) but not the same. Here an example which describes my problem:

mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))


df1
   mz1 rt1    value1
1  100   1 44605.646
2  110   2 13924.598
3  120   3 35727.265
4  130   4 75175.652
5  140   5 25221.724
6  150   6 29080.653
7  160   7  3170.749
8  170   8 10184.708
9  180   9 48055.072
10 190  10 77644.865


df2
        mz2      rt2   value2
1  100.0243 1.043092 58099.49
2  110.0514 2.164753 76397.67
3  120.0258 2.838141 43901.05
4  130.0921 4.044322 34543.96
5  139.9577 5.023823 53086.10
6  150.0170 6.061794 13929.27
7  160.0884 6.828779 60905.61
8  170.0440 7.932000 66627.20
9  180.0872 9.116425 44587.62
10 189.9694 9.834091 51186.03

I want to merge all rows from df1 and df2 which have a difference <= 0.1 in the rt domain and a difference <= 0.05 in the mz domain. In addition, if there are two or more rows which fulfill this criteria the row with the smallest distance to both domains should be merged (maybe an additional calculation is necessary: distance = sqrt(mz^2+rt^2)) and the remaining rows have to find a different merging partner if existing. If there is no merging partner keep the row and fill "NA" to the missing value.

What I have tried so far:

merge.data.frame(df1, df2, by.x = c("mz1", "rt1"), by.y = c("mz2", "rt2") , all = T)

        mz1 rt1    value1      rt2   value2
1  100.0000   1 44605.646       NA       NA
2  100.0243  NA        NA 1.043092 58099.49
3  110.0000   2 13924.598       NA       NA
4  110.0514  NA        NA 2.164753 76397.67
5  120.0000   3 35727.265       NA       NA
6  120.0258  NA        NA 2.838141 43901.05
7  130.0000   4 75175.652       NA       NA
8  130.0921  NA        NA 4.044322 34543.96
9  139.9577  NA        NA 5.023823 53086.10
10 140.0000   5 25221.724       NA       NA
11 150.0000   6 29080.653       NA       NA
12 150.0170  NA        NA 6.061794 13929.27
13 160.0000   7  3170.749       NA       NA
14 160.0884  NA        NA 6.828779 60905.61
15 170.0000   8 10184.708       NA       NA
16 170.0440  NA        NA 7.932000 66627.20
17 180.0000   9 48055.072       NA       NA
18 180.0872  NA        NA 9.116425 44587.62
19 189.9694  NA        NA 9.834091 51186.03
20 190.0000  10 77644.865       NA       NA

This gives me at least a data frame in the right format, which contains NA's where no merging was possible.

It would be awesome if someone could me help with this problem!

Greetings

Update

Alright, I will keep that in mind. Thank you so far. I have tried the following as an idea:

#select data in joined which has no partner
no_match_df1 <- anti_join(joined, df2)
no_match_df1 <- no_match_df1[1:3]

#select data in df2 which has been excluded due to duplication
collist <- c("mz2", "rt2", "value2")
dublicates <- joined[complete.cases(joined[collist]), collist]
dublicates <- anti_join(df2, dublicates)


#repetition for joining
joined2 <- fuzzy_join(no_match_df1, dublicates, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

joined2 <- group_by(joined2, mz1, rt1) %>%
  mutate(min_dist = min(dist))
head(joined2)

joined2 <- filter(joined2, dist == min_dist | is.na(dist)) %>%
  select(-dist, -min_dist)
head(joined2)

#select only rows with new match or where dublicates coulnd't find a partner

add <- subset(joined2, !is.na(joined2$mz2) | !is.na(joined2$mz2) &  !is.na(joined2$mz1))

#add to joined
##I need some help here, how can I update the existing joined data frame?

Maybe this helps

Maybe we can join the no_match_df1 with the duplicates as we have done it before and just add the results by overwriting the particularly rows in the existing joined data frame. Finally, we have to repeat that process as log as the length of duplicates is >1.

i too have come across such cases from my colleagues and they used to tease that its way easy to implement such merge on conditions in SAS. — joel.wilson, Jan 05 '17 at 07:14
@Marco, If you need to update your question because an answer helped you somehow [edit] your own question, not the answer which helped you. — llrs, Jan 05 '17 at 13:25
Ok, sry my mistake. I'm pretty new here and that was actually a wrong click. — Marco, Jan 05 '17 at 13:46

Stibu · Accepted Answer · 2017-01-05T20:10:57.153

Following the advice by joran, I found a solution using the fuzzyjoin package. I created the data sets as follows:

set.seed(123)
mz1    <- c(seq(100, 190, by = 10))
rt1    <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2    <- mz1 + runif(10, -0.1, 0.1)
rt2    <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)

df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))

(A little side remark: you made an excellent reproducible example. The only weakness is that you did not set a seed, which is the only difference of the above to your code.)

To make sure that there is a case, where two matches are found, I add an additional line to df2:

df2 <- rbind(df2, c(180.001, 9.09, 0))

Now, I can use the function fuzzy_join() to merge the data frames:

library(fuzzyjoin)
joined <- fuzzy_join(df1, df2, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
                     multi_match_fun = mmf, mode = "full")

Note that the syntax is very similiar to join() from dplyr. There is a crucial difference however: you can provide a function for multi_match_fun, which determines whether two rows match. It returns a data frame, where the first column must be logical. This column determines, whether two rows match or not. All the other columns are simply added to the resulting data frame. I defined this function as follows:

mmf <- function(x, y) {
  mz_dist <- abs(x[, 1] - y[, 1])
  rt_dist <- abs(x[, 2] - y[, 2])

  out <- data_frame(merge = rt_dist <= 0.1 & mz_dist < 0.05,
                    dist = sqrt(mz_dist^2 + rt_dist^2))
  return (out)
}

You can see that the column merge (the name is arbitrary) is TRUE exactly if the conditions you specified are satisfied. In addition, a column containing the distance is added for later use. I set mode = "full" in order to have the NA values, if there is no match.

The result looks as follows:

head(joined)
##   mz1 rt1   value1      mz2      rt2   value2       dist
## 1 110   2 78851.68 109.9907 2.077121 90239.67 0.07768406
## 2 120   3 40956.79 120.0355 3.056203 69101.46 0.06648308
## 3 180   9 55188.36 179.9656 8.915664 31886.28 0.09108803
## 4 180   9 55188.36 180.0010 9.090000     0.00 0.09000556
## 5 100   1 28828.99       NA       NA       NA         NA
## 6 130   4 88313.44       NA       NA       NA         NA

In rows 3 and 4 you can see, that there were indeed two matches in this case. And from the column dist you can see that row 4 is the one we want to keep. This means that row 3 should be treated as not finding a match, and the columns mz1, rt1, and value1 should be filled with NA. I did this by grouping the rows by mz1 and rt1 and then adding the minimum value of distance for each group:

library(dplyr)
joined <- group_by(joined, mz1, rt1) %>%
            mutate(min_dist = min(dist))
head(joined)
## Source: local data frame [6 x 8]
## Groups: mz1, rt1 [5]
## 
##     mz1   rt1   value1      mz2      rt2   value2       dist   min_dist
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>      <dbl>      <dbl>
## 1   110     2 78851.68 109.9907 2.077121 90239.67 0.07768406 0.07768406
## 2   120     3 40956.79 120.0355 3.056203 69101.46 0.06648308 0.06648308
## 3   180     9 55188.36 179.9656 8.915664 31886.28 0.09108803 0.09000556
## 4   180     9 55188.36 180.0010 9.090000     0.00 0.09000556 0.09000556
## 5   100     1 28828.99       NA       NA       NA         NA         NA
## 6   130     4 88313.44       NA       NA       NA         NA         NA

The rows with valid matches are all those, where dist is the same as min_dist. In addition, we also should also not loose the rows where dist is NA. This can be done as follows:

dbls <- which(joined$dist != joined$min_dist)
joined[dbls, c("mz1", "rt1", "value1")] <- NA
joined <- select(joined, -dist, -min_dist)
head(joined)
## Source: local data frame [6 x 6]
## Groups: mz1, rt1 [6]
## 
##     mz1   rt1   value1      mz2      rt2   value2
##   <dbl> <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
## 1   110     2 78851.68 109.9907 2.077121 90239.67
## 2   120     3 40956.79 120.0355 3.056203 69101.46
## 3    NA    NA       NA 179.9656 8.915664 31886.28
## 4   180     9 55188.36 180.0010 9.090000     0.00
## 5   100     1 28828.99       NA       NA       NA
## 6   130     4 88313.44       NA       NA       NA

Depending on how your data looks, it might also be possible, that in the case of a double match, the values of mz1 and rt1 do not agree, but another pair of values does. You would then have to repeat the above step with other groupings as well.

Wow, I'm really happy about all the help I received from you guys! — Marco, Jan 05 '17 at 10:32
That is pretty much what I want. There is only one thing missing, only row 4 from df2 was retained. In the first place that is the correct decision, but I don't want to loose the information from row 3 in df2. Something like, take the data from df2, which found a partner in the first place but was rejected due to double matching and a too long distance. This data should have a second chance to find a partner in df1 (which does not already have a partner) with the same criteria. Repeat as long as there are doubles. — Marco, Jan 05 '17 at 10:44
@Marco You are right: you want `NA` values for lines that did not match and therefore, I should not simply throw away the doubles. I will update my answer to take care of this a little later, since I don't have time now. Note, however, that while it is easy enough to do in the simple example, things might get much more complicated in a more complex situation. — Stibu, Jan 05 '17 at 11:39
Thanks, I have updated my post with an idea. I'm not sure how everything works here in the forum and if it is already visible for you. — Marco, Jan 05 '17 at 13:02
@Marco I have updated my answer as promised. In your edit you propose a different approach to solve the problem. While this approach may lead to a god solution, your edit unfortunately changes the question: instead of asking how to do a merge with certain conditions, you now ask how you should go on after you have solved parts of the problem. It is not a good idea to change the question after you have received answers, because it invalidates the answers that you already got. If you have a new question, then you should ask a new question. — Stibu, Jan 05 '17 at 20:15
Sry that was of course not my intention. The update had the only purpose to make my problem clearer. Thank you for your help so far, I like your solution. The only problem at the moment is, I would like to repeat the matching step with all the doubled rows rather than just add them and fill the corresponding df1 columns with NA's. However, as you mentioned, I should open a new question for this purpose. I will accept your answer, thanks again! You have been such a great help! — Marco, Jan 06 '17 at 09:44
You are welcome! I don't understand your remark regarding repeating the matching step with the doubled rows. If there are other matches for that doubled row, then they will have been found in the first attempt already, so there is no need to search for these again. But maybe there is a misunderstanding on my side... — Stibu, Jan 06 '17 at 09:49

How to merge two data frames based on similar values in R

1 Answers1