Joining Data Frames by Measured Values with an Error Range

Question

I'm looking for a way to join (or perhaps merge) two or more data frames in R containing measured values with a specified range of error. This means that the value in the "by" column would be nnn.nnnn +/- 0.000n. The error tolerance limited to 3 e-6 times the value.

This is my best attempt so far.

newDF <- left_join(P0511_480k, P0511_SF00V, by=c(P0511_480k$m.z == (P0511_SF00V$m.z - 0.000003(P0511_480k$m.z)) : (P0511_SF00V$m.z + 0.000003(P0511_480k$m.z))))

In this expression, I have two data frames (P0511_480k and P0511_SF00V) and I would like to merge them by a column named "m.z". The acceptable range of values is plus or minus "m.z" times 0.000003. For example, P0511_480k_subset$m.z = 187.06162 should match P0511_SF00V_subset$m.z = 187.06155.

> dput(head(P0511_480k_subset, 10))
structure(list(m.z = c(187.06162, 203.05652, 215.05668, 217.07224, 
279.05499), Intensity = c(319420.8, 288068.9, 229953, 210107.8, 
180054), Relative = c(100, 90.18, 71.99, 65.78, 56.37), Resolution = c(394956.59, 
415308.31, 387924.91, 437318.31, 410670.91), Baseline = c(2.1, 
1.43, 1.69, 1.73, 3.04), Noise = c(28.03, 27.17, 27.52, 27.58, 
29.37)), .Names = c("m.z", "Intensity", "Relative", "Resolution", 
"Baseline", "Noise"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-5L))

and

> dput(head(P0511_SF00V_subset, 10))
structure(list(m.z = c(187.06155, 203.05641, 215.05654, 217.0721
), Intensity = c(1021342.8, 801347.1, 662928.1, 523234.2), Relative = c(100, 
78.46, 64.91, 51.23), Resolution = c(314271.88, 298427.41, 289803.97, 
288163.63), Baseline = c(6.89, 10.47, 9.13, 8.89), Noise = c(40.94, 
45.98, 44.3, 44.01)), .Names = c("m.z", "Intensity", "Relative", 
"Resolution", "Baseline", "Noise"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -4L))

I appreciate your suggestions! I've searched through the help documentation as broadly as possible and I have not been able to find an example that is close to what I need.

Many thanks!

Please provide your data (or a subset of it) using `dput()` or `dput(head(df, 20))`. Also, you need to specify `*` when you do a multiplication (even if the number is before a parenthesis) — etienne, Nov 22 '16 at 22:40
Check out the [*fuzzyjoin* package](https://github.com/dgrtwo/fuzzyjoin), which is a variation of dplyr's join operations. — aosmith, Nov 22 '16 at 22:47
I think you need something like `data.table::foverlaps()`, provide data and expected output. — zx8754, Nov 22 '16 at 22:50
Possible duplicate of http://stackoverflow.com/questions/24480031 — zx8754, Nov 22 '16 at 22:52
Thanks etienne! with this "test2 <- left_join(P0511_480k, P0511_SF00V, by=c(P0511_480k$m.z == (P0511_SF00V$m.z - 0.000003*(P0511_480k$m.z)) : (P0511_SF00V$m.z + 0.000003*(P0511_480k$m.z))))" I get this: Error: cannot join on columns 'FALSE' x 'FALSE': index out of bounds In addition: Warning messages: 1: In P0511_SF00V$m.z - 3e-06 * (P0511_480k$m.z) : longer object length is not a multiple of shorter object length 2: In P0511_SF00V$m.z + 3e-06 * (P0511_480k$m.z) : longer object length is not a multiple of shorter object length .... — Lynn Mazzoleni, Nov 22 '16 at 23:22

code_is_entropy · Accepted Answer · 2016-11-23T03:40:08.683

0

If you don't need the non-matching rows then this can work. Assume the two data sets are df1 and df2. Look through the m.z column in df1 and if it is within the 0.000003 tolerance of any value in m.z column of df2, replace that value in df1 with the corresponding matched value in df2. Then merge the two data frames.

df1$m.z <- sapply(df1$m.z, function(x)
                 {
                  # First check if the element lies within tolerance limits of any element in df2
                  ifelse(min(abs(df2$m.z - x), na.rm=TRUE) < 0.000003 * x,
                  # If yes, replace that element in df1 with the matching element in df2
                   df2[which.min(abs(df2$m.z - x)),"m.z"], 0)
                 })
df3 <- merge(df1, df2)

edited Nov 23 '16 at 03:40

answered Nov 23 '16 at 01:16

code_is_entropy

611
3
11

Yes, but sapply converts the m.z values to a list. – Lynn Mazzoleni Nov 23 '16 at 16:19
So, I added "df1$m.z <- as.numeric(df1$m.z)". It appears to work, but this is process is clunky considering that I have 5 data frames to merge. – Lynn Mazzoleni Nov 23 '16 at 16:43
I see. I agree it will be somewhat messy, but I would use a for loop or lapply for 5 data frames. – code_is_entropy Nov 23 '16 at 18:56

Joining Data Frames by Measured Values with an Error Range

1 Answers1

Linked