0

When I use dplyr::filter with a sequence and %in% it randomly leaves out rows that it shouldn't. Is there a better way to filter the data so that I reliably get a df that includes every value of q from 0.01 to 1 by steps of 0.01?

Here is a snippet of my data to create df

df <- structure(list(q = c(0.0495185253755619, 0.05, 0.0532000452215362, 
0.0569525370086692, 0.06, 0.0646716714872386, 0.07, 0.0767903072707, 
0.08, 0.0809750285664481, 0.09, 0.0939688126826123, 0.1, 0.103000546236258, 
0.11, 0.117107570056396), r_timestamp = structure(c(1403667900, 
NA, 1403668800, 1403669700, NA, 1403670600, NA, 1403671500, NA, 
1403672400, NA, 1403673300, NA, 1403674200, NA, 1403675100), class = c("POSIXct", 
"POSIXt"), tzone = "Etc/GMT-4"), NO3_rise = c(0.0482379790550339, 
NA, 0.0482408804822149, 0.0496608873041167, NA, 0.0510808941260188, 
NA, 0.053096735586062, NA, 0.0551125770461051, NA, 0.0559331273472383, 
NA, 0.0567536776483717, NA, 0.0531344453067981)), row.names = c(NA, 
-16L), class = "data.frame")

Here is the code. The resulting df2 should have 7 rows and a q value for 0.05 to 0.11 by steps of 0.01. The code currently returns df2 with only 4 lines including q values of 0.05, 0.08, 0.09, 0.11.

# Packages
 library("tidyverse")
 library("lubridate")
 library("zoo")

# Code chunk
  df2 <- df %>% 
    # Interpolate missing solute values
    mutate_at(vars(c(NO3_rise)),
              funs(na.approx(., x = q, xout = q, na.rm = FALSE))) %>% 
    # Only keep rows where q value matches sequence below
    filter(q %in% seq(0.01, 1, by = 0.01))
D Kincaid
  • 167
  • 1
  • 13
  • 3
    It's impossible to help without being able to reproduce your problem. Instead of all that code, can you use `dput` to share a minimal example of your dataframe that reproduces this issue? My best guess would be that `q` is a floating point number, so you have values that appear to be, for example, `0.02`, but are actually (due to floating point rounding errors in however they were generate) something like `0.0200000000001` – divibisan May 02 '19 at 16:32
  • To add to @divibisan's comment: give us a sample of what goes into the `group_by` at the very end. We probably only need the last 5 lines to debug, not a whole 63 lines. You mark checking that `q` matches a step above—you might be better off filtering against a lag or something – camille May 02 '19 at 16:37
  • 3
    I've answered to explain this problem in detail, but really it could be considered a duplicate of: [Why are these numbers not equal?](https://stackoverflow.com/questions/9508518/why-are-these-numbers-not-equal) – divibisan May 02 '19 at 19:24

1 Answers1

1

This is an issue caused by floating point precision problems. Let's look at the 5th row:

df$q[5]
[1] 0.06

df$q[5] == 0.06
[1] TRUE

df$q[5] %in% seq(0.01, 1, by = 0.01)
[1] FALSE

Why? Let's look at its real value. While R rounds it to 0.06, the actual value is slightly lower due to floating point rounding errors:

sprintf("%.54f",df$q[5])
[1] "0.059999999999999997779553950749686919152736663818359375"

# It's the same as how R represents 0.06
sprintf("%.54f",0.06)
[1] "0.059999999999999997779553950749686919152736663818359375"

# But when made by seq, the number is different!
sprintf("%.54f",seq(0.01, 1, by = 0.01)[6])
[1] "0.060000000000000004718447854656915296800434589385986328"

So what can you do? The safe option is to use all.equal, which matches with an allowed tolerance designed to work with floating point numbers. Here's a (likely unoptimized) way to use all.equal to compare 2 vectors in the same way as %in%:

fp_all_equal <- function(x, y) {
    as.logical(colSums(sapply(x, function(x) as.logical(mapply(all.equal, x, y))), na.rm = T))
}

fp_all_equal(df$q, seq(0.01, 1, by = 0.01))
 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE

Another option is to avoid using seq with floating point numbers, since it seems to calculate them slightly differently than other R functions. I can't explain why (and I don't promise that this will always work), but if you make an integer sequence with seq and then divide it by 100 with /, using %in% will work:

df$q %in% (seq(1,100)/100)
 [1] FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE  TRUE FALSE
divibisan
  • 11,659
  • 11
  • 40
  • 58