I'm relatively new in R and I have a question about merging two data frames, which does contain similar numeric data from two domains (mz and rt) but not the same. Here an example which describes my problem:
mz1 <- c(seq(100, 190, by = 10))
rt1 <- c(seq(1, 10, by = 1))
value1 <- runif(10, min = 100, max = 100000)
mz2 <- mz1 + runif(10, -0.1, 0.1)
rt2 <- rt1 + runif(10, -0.2, 0.2)
value2 <- runif(10, min = 100, max = 100000)
df1 <- as.data.frame(cbind(mz1, rt1, value1))
df2 <- as.data.frame(cbind(mz2, rt2, value2))
df1
mz1 rt1 value1
1 100 1 44605.646
2 110 2 13924.598
3 120 3 35727.265
4 130 4 75175.652
5 140 5 25221.724
6 150 6 29080.653
7 160 7 3170.749
8 170 8 10184.708
9 180 9 48055.072
10 190 10 77644.865
df2
mz2 rt2 value2
1 100.0243 1.043092 58099.49
2 110.0514 2.164753 76397.67
3 120.0258 2.838141 43901.05
4 130.0921 4.044322 34543.96
5 139.9577 5.023823 53086.10
6 150.0170 6.061794 13929.27
7 160.0884 6.828779 60905.61
8 170.0440 7.932000 66627.20
9 180.0872 9.116425 44587.62
10 189.9694 9.834091 51186.03
I want to merge all rows from df1 and df2 which have a difference <= 0.1 in the rt domain and a difference <= 0.05 in the mz domain. In addition, if there are two or more rows which fulfill this criteria the row with the smallest distance to both domains should be merged (maybe an additional calculation is necessary: distance = sqrt(mz^2+rt^2)) and the remaining rows have to find a different merging partner if existing. If there is no merging partner keep the row and fill "NA" to the missing value.
What I have tried so far:
merge.data.frame(df1, df2, by.x = c("mz1", "rt1"), by.y = c("mz2", "rt2") , all = T)
mz1 rt1 value1 rt2 value2
1 100.0000 1 44605.646 NA NA
2 100.0243 NA NA 1.043092 58099.49
3 110.0000 2 13924.598 NA NA
4 110.0514 NA NA 2.164753 76397.67
5 120.0000 3 35727.265 NA NA
6 120.0258 NA NA 2.838141 43901.05
7 130.0000 4 75175.652 NA NA
8 130.0921 NA NA 4.044322 34543.96
9 139.9577 NA NA 5.023823 53086.10
10 140.0000 5 25221.724 NA NA
11 150.0000 6 29080.653 NA NA
12 150.0170 NA NA 6.061794 13929.27
13 160.0000 7 3170.749 NA NA
14 160.0884 NA NA 6.828779 60905.61
15 170.0000 8 10184.708 NA NA
16 170.0440 NA NA 7.932000 66627.20
17 180.0000 9 48055.072 NA NA
18 180.0872 NA NA 9.116425 44587.62
19 189.9694 NA NA 9.834091 51186.03
20 190.0000 10 77644.865 NA NA
This gives me at least a data frame in the right format, which contains NA's where no merging was possible.
It would be awesome if someone could me help with this problem!
Greetings
Update
Alright, I will keep that in mind. Thank you so far. I have tried the following as an idea:
#select data in joined which has no partner
no_match_df1 <- anti_join(joined, df2)
no_match_df1 <- no_match_df1[1:3]
#select data in df2 which has been excluded due to duplication
collist <- c("mz2", "rt2", "value2")
dublicates <- joined[complete.cases(joined[collist]), collist]
dublicates <- anti_join(df2, dublicates)
#repetition for joining
joined2 <- fuzzy_join(no_match_df1, dublicates, multi_by = c("mz1" = "mz2", "rt1" = "rt2"),
multi_match_fun = mmf, mode = "full")
joined2 <- group_by(joined2, mz1, rt1) %>%
mutate(min_dist = min(dist))
head(joined2)
joined2 <- filter(joined2, dist == min_dist | is.na(dist)) %>%
select(-dist, -min_dist)
head(joined2)
#select only rows with new match or where dublicates coulnd't find a partner
add <- subset(joined2, !is.na(joined2$mz2) | !is.na(joined2$mz2) & !is.na(joined2$mz1))
#add to joined
##I need some help here, how can I update the existing joined data frame?
Maybe we can join the no_match_df1
with the duplicates
as we have done it before and just add the results by overwriting the particularly rows in the existing joined
data frame.
Finally, we have to repeat that process as log as the length of duplicates
is >1.