-1

I often have data sets where I have multiple event measures over time and I want to take the maximum date for each event within a month. I do this by creating a year and month variable and then ordering by date in descending order and then group_by all variables except date and then using a slice to get the max date. I heard Hadely say in a video that arrange is a slow operation. I was wondering what the efficient way to do this was within the tidyverse.

Please post base, data.table, other answers as well so that other uses may benefit from this question but my desire here is a tidyverse approach.

How I do this operation currently:

library(tidyverse)

set.seed(10)
dat <- data_frame(
    date = sample(seq(as.Date('1999/01/01'), as.Date('2001/01/01'), by="day"), 1000, TRUE),
    cash = sample(1010:1030, 1000, TRUE),
    stage = sample(LETTERS[1:7], 1000, TRUE)
) %>% distinct()


dat %>%
    mutate(
        year = format(date, '%Y'),
        month = format(date, '%B')
    ) %>%
    arrange(desc(date)) %>%
    group_by(cash, stage, year, month) %>%
    slice(1)
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • 2
    `group_by(cash, stage, year, month) %>% slice(which.max(date))`? – Axeman Mar 02 '17 at 20:06
  • Welp, guess not, yours is faster.. – Axeman Mar 02 '17 at 20:09
  • 2
    There are some options/benchmarks [here](http://stackoverflow.com/questions/24070714/extract-row-corresponding-to-minimum-value-of-a-variable-by-group). – David Arenburg Mar 02 '17 at 20:10
  • 2
    You can also use `yearmonth = format(date, '%Y-%m')` instead of two variables – talat Mar 02 '17 at 20:15
  • 2
    Also [this](http://stackoverflow.com/questions/24558328/how-to-select-the-row-with-the-maximum-value-in-each-group) – David Arenburg Mar 02 '17 at 20:16
  • Assuming this doesn't get closed as a dupe, you probably want to write your example so that it's extensible to more rows (like assign some `n` that can be used to scale it up). – Frank Mar 02 '17 at 20:23
  • I imagine that OP's approach with arrange performs better than the `slice(which.max(date))` approach since the ordering is done only once for the whole data and only the simple `slice(1)` is a grouped operation whereas in the other case `slice(which.max(date))` is done by group. Perhaps it's relevant that in previous dplyr versions, `arrange` used to also order by grouping variables, afaik. – talat Mar 02 '17 at 20:32

2 Answers2

6

The OP doesn't include a way to scale up the benchmark, so I'm making my own:

library(data.table)
library(dplyr)    

n = 3e6
n_days = 20000

set.seed(10)
dat <- data_frame(
    date = sample(
      seq(as.Date('1999/01/01'), as.Date('1999/01/01') + n_days - 1, by="day")
      , n, TRUE),
    cash = sample(1010:1030, n, TRUE),
    stage = sample(LETTERS[1:7], n, TRUE)
) %>% distinct()

DT = data.table(dat)[, date := as.IDate(date)]

Testing:

# OP's approach
system.time(
  res <- dat %>%
    mutate(
        year = format(date, '%Y'),
        month = format(date, '%B')
    ) %>%
    arrange(desc(date)) %>%
    group_by(cash, stage, year, month) %>%
    slice(1)
)
#    user  system elapsed 
#    9.44    0.09    9.54 

# a data.table way
system.time({
  DTres <- DT[, g := date - mday(date) + 1L ][order(-date), .SD[1L], by=.(cash, stage, g)]
})
#    user  system elapsed 
#    0.51    0.00    0.52 

# verify
fsetequal(
  data.table(res[, c("cash","stage","date")])[, date := as.IDate(date)][], 
  DTres[, c("cash","stage","date")]
) # TRUE

Translating back to dplyr:

system.time({
  newres <- dat %>% mutate(g = date - as.POSIXlt(date)$mday + 1) %>% 
    arrange(desc(date)) %>% group_by(cash, stage, g) %>% slice(1L)
})
# Error, who knows why?

system.time({
  newres <- dat %>% mutate(g = date + 1 - date %>% as.POSIXlt %>% `[[`("mday")) %>% 
    arrange(desc(date)) %>% group_by(cash, stage, g) %>% slice(1L)
})
#    user  system elapsed 
#    1.47    0.04    1.52

fsetequal(
  data.table(res[c("date","cash","stage")]),   
  data.table(newres[c("date","cash","stage")])
) # TRUE

# or ...
iddat <- dat %>% mutate(date = data.table::as.IDate(date))
mday <- data.table::mday
system.time({
  borrowres <- iddat %>% arrange(desc(date)) %>% 
    distinct(cash, stage, g = date - mday(date) + 1L)
})
#    user  system elapsed 
#    0.92    0.02    0.94 

fsetequal(
  data.table(borrowres[names(DTres)]),   
  DTres
) # TRUE

The relative timings didn't change much as I tweaked n and n_days. Thanks to @Arun for this way of rounding. Previously, I had round(date, "months"). It seems that the key is using arithmetic instead of format. I'm not sure about the remaining difference in the timings; maybe it can be addressed by using dtplyr. The switch to arrange %>% distinct didn't do much beyond cleaning up the syntax.

Side note: I'm loading dplyr instead of tidyverse because I don't really know what the latter contains. I tried it with tidyverse and got the same timings, though.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • @Arun I guess you mean `mday`? Yeah, it's 5x faster with `system.time(DTres2 <- DT[, g := date - as.integer(mday(date)) + 1L ][order(-date), .SD[1L], by=.(cash, stage, g)])` Thanks. I'll edit it later. Feel free to edit; I'm not entirely sure my "why it's faster" is right. (I don't know if as.integer is needed there.. will check later.) – Frank Mar 03 '17 at 22:56
2

A couple of other data.table options:

f.dt <- function(dat) {
    DT <- data.table(dat)[,yearmon := format(date, "%Y %B")]
    DT[order(-date),idx := 1:.N, 
       by = .(cash, stage, yearmon)
    ][idx == 1, !"idx"][]
}

f2.dt <- function(dat) {
    DT <- data.table(dat)[,yearmon := format(date, "%Y %B")]
    DT[DT[, .I[which.max(date)], 
        by = .(cash, stage, yearmon)]$V1,][]
}

The second one uses Señor O's second approach in the linked question.

Testing these against

f.dplyr <- function(dat) {
    dat %>%
        mutate(
            yearmon = format(date, '%Y %B')
        ) %>%
        arrange(desc(date)) %>%
        group_by(cash, stage, yearmon) %>%
        slice(1)    
}

with Frank's data,

fsetequal(f.dt(dat), data.table(f.dplyr(dat)))
# [1] TRUE

fsetequal(f2.dt(dat), data.table(f.dplyr(dat)))
# [1] TRUE

microbenchmark::microbenchmark(
    f.dplyr(dat),
    f.dt(dat),
    f2.dt(dat),
    times = 10L
)
# Unit: seconds
#          expr      min       lq     mean   median       uq      max neval
#  f.dplyr(dat) 3.446304 3.562061 3.601803 3.598340 3.625105 3.860911    10
#     f.dt(dat) 1.525025 1.540881 1.727772 1.561149 1.718817 2.422788    10
#    f2.dt(dat) 1.299834 1.315242 1.510534 1.384346 1.667197 2.262938    10

Data

n = 3e6
n_days = 20000

set.seed(10)
dat <- dplyr::data_frame(
    date = sample(
      seq(as.Date('1999/01/01'), 
          as.Date('1999/01/01') + n_days - 1, 
          by = "day"), n, TRUE),
    cash = sample(1010:1030, n, TRUE),
    stage = sample(LETTERS[1:7], n, TRUE)
) %>% dplyr::distinct()
Community
  • 1
  • 1
nrussell
  • 18,382
  • 4
  • 47
  • 60