Hi I got differents results from dplyr function when I use standard evaluation through lazyeval package.
Here is how to reproduce something close to my real datas with 250k rows and about 230k groups. I would like to group by id1, id2 and subset the rows with the max(datetime) for each group.
library(dplyr)
# random datetime generation function by Dirk Eddelbuettel
# http://stackoverflow.com/questions/14720983/efficiently-generate-a-random-sample-of-times-and-dates-between-two-dates
rand.datetime <- function(N, st = "2012/01/01", et = "2015/08/13") {
st <- as.POSIXct(as.Date(st))
et <- as.POSIXct(as.Date(et))
dt <- as.numeric(difftime(et,st,unit="sec"))
ev <- sort(runif(N, 0, dt))
rt <- st + ev
}
set.seed(42)
# Creating 230000 ids couples
ids <- data_frame(id1 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"),
id2 = stringi::stri_rand_strings(23e4, 9, pattern = "[0-9]"))
# Repeating randomly the ids[1:2000, ] to create groups
ids <- rbind(ids, ids[sample(1:2000, 20000, replace = TRUE), ])
datas <- mutate(ids, datetime = rand.datetime(25e4))
When I use the NSE way I got 230000 rows
df1 <-
datas %>%
group_by(id1, id2) %>%
filter(datetime == max(datetime))
nrow(df1) #230000
But when I use the SE, I got only 229977 rows
ids <- c("id1", "id2")
filterVar <- "datetime"
filterFun <- "max"
df2 <-
datas %>%
group_by_(ids) %>%
filter_(.dots = lazyeval::interp(~var == fun(var),
var = as.name(filterVar),
fun = as.name(filterFun)))
nrow(df2) #229977
My two pieces of code are equivalent right ? Why do I experience different results ? Thanks.