2

I'm looking for a way to join (or perhaps merge) two or more data frames in R containing measured values with a specified range of error. This means that the value in the "by" column would be nnn.nnnn +/- 0.000n. The error tolerance limited to 3 e-6 times the value.

This is my best attempt so far.

newDF <- left_join(P0511_480k, P0511_SF00V, by=c(P0511_480k$m.z == (P0511_SF00V$m.z - 0.000003(P0511_480k$m.z)) : (P0511_SF00V$m.z + 0.000003(P0511_480k$m.z))))

In this expression, I have two data frames (P0511_480k and P0511_SF00V) and I would like to merge them by a column named "m.z". The acceptable range of values is plus or minus "m.z" times 0.000003. For example, P0511_480k_subset$m.z = 187.06162 should match P0511_SF00V_subset$m.z = 187.06155.

> dput(head(P0511_480k_subset, 10))
structure(list(m.z = c(187.06162, 203.05652, 215.05668, 217.07224, 
279.05499), Intensity = c(319420.8, 288068.9, 229953, 210107.8, 
180054), Relative = c(100, 90.18, 71.99, 65.78, 56.37), Resolution = c(394956.59, 
415308.31, 387924.91, 437318.31, 410670.91), Baseline = c(2.1, 
1.43, 1.69, 1.73, 3.04), Noise = c(28.03, 27.17, 27.52, 27.58, 
29.37)), .Names = c("m.z", "Intensity", "Relative", "Resolution", 
"Baseline", "Noise"), class = c("tbl_df", "data.frame"), row.names = c(NA, 
-5L))

and

> dput(head(P0511_SF00V_subset, 10))
structure(list(m.z = c(187.06155, 203.05641, 215.05654, 217.0721
), Intensity = c(1021342.8, 801347.1, 662928.1, 523234.2), Relative = c(100, 
78.46, 64.91, 51.23), Resolution = c(314271.88, 298427.41, 289803.97, 
288163.63), Baseline = c(6.89, 10.47, 9.13, 8.89), Noise = c(40.94, 
45.98, 44.3, 44.01)), .Names = c("m.z", "Intensity", "Relative", 
"Resolution", "Baseline", "Noise"), class = c("tbl_df", "data.frame"
), row.names = c(NA, -4L))

I appreciate your suggestions! I've searched through the help documentation as broadly as possible and I have not been able to find an example that is close to what I need.

Many thanks!

Ivo
  • 3,890
  • 5
  • 22
  • 53
  • Please provide your data (or a subset of it) using `dput()` or `dput(head(df, 20))`. Also, you need to specify `*` when you do a multiplication (even if the number is before a parenthesis) – etienne Nov 22 '16 at 22:40
  • Check out the [*fuzzyjoin* package](https://github.com/dgrtwo/fuzzyjoin), which is a variation of dplyr's join operations. – aosmith Nov 22 '16 at 22:47
  • I think you need something like `data.table::foverlaps()`, provide data and expected output. – zx8754 Nov 22 '16 at 22:50
  • Possible duplicate of http://stackoverflow.com/questions/24480031 – zx8754 Nov 22 '16 at 22:52
  • Thanks etienne! with this "test2 <- left_join(P0511_480k, P0511_SF00V, by=c(P0511_480k$m.z == (P0511_SF00V$m.z - 0.000003*(P0511_480k$m.z)) : (P0511_SF00V$m.z + 0.000003*(P0511_480k$m.z))))" I get this: Error: cannot join on columns 'FALSE' x 'FALSE': index out of bounds In addition: Warning messages: 1: In P0511_SF00V$m.z - 3e-06 * (P0511_480k$m.z) : longer object length is not a multiple of shorter object length 2: In P0511_SF00V$m.z + 3e-06 * (P0511_480k$m.z) : longer object length is not a multiple of shorter object length .... – Lynn Mazzoleni Nov 22 '16 at 23:22

1 Answers1

0

If you don't need the non-matching rows then this can work. Assume the two data sets are df1 and df2. Look through the m.z column in df1 and if it is within the 0.000003 tolerance of any value in m.z column of df2, replace that value in df1 with the corresponding matched value in df2. Then merge the two data frames.

df1$m.z <- sapply(df1$m.z, function(x)
                 {
                  # First check if the element lies within tolerance limits of any element in df2
                  ifelse(min(abs(df2$m.z - x), na.rm=TRUE) < 0.000003 * x,
                  # If yes, replace that element in df1 with the matching element in df2
                   df2[which.min(abs(df2$m.z - x)),"m.z"], 0)
                 })
df3 <- merge(df1, df2)
code_is_entropy
  • 611
  • 3
  • 11