3

I am looking for the best way to implement the creation of a new variable,numWithin365, defined as follows:

Given a column of dates, dates, count the number of other dates in the column within the preceding 365 days. This problem could be generalized beyond a vector of dates.

Here is one implementation; I am looking for any advice that could help it scale better.

library(dplyr)

# set seed for reproducibility
set.seed(42)

# function to calculate number of dates in prior year
within365 <- function(col){
  sapply(col, function(x){
    sum(x-365 < col & col <= x-1)
    }
  )
}
# fake data sorted chronologically
df <- data.frame(dates = sample(seq(as.Date('2015/01/01'), as.Date('2020/12/31'), 
                by="day"), 10)) %>% arrange(dates)

# applying the function
df %>% mutate(numWithin365 = within365(dates))
        dates numWithin365
1  2015-12-22            0
2  2016-09-25            1
3  2018-01-02            0
4  2018-02-25            1
5  2018-03-22            2
6  2018-06-05            3
7  2018-08-19            4
8  2019-06-13            1
9  2020-09-02            0
10 2020-09-27            1
Jon S
  • 165
  • 6
  • This was unintentional on my part, but I think one could be able to leverage the sorting and only check dates “before”. – Jon S Feb 02 '21 at 03:01
  • 3
    @Dries You could use a (by now pretty standard) `data.table` non-equi join and aggregate the matches with `by = .EACHI`. `d[ , from := dates - 365]`; `d[d, on = .(dates < dates, dates >= from), .N, by = .EACHI]`. I leave it to your google-fu to find similar posts on SO (there are heaps of them). – Henrik Aug 04 '21 at 13:41
  • @Henrik Why do you need to include `on`? I'm having a hard time finding this in the wiki. – Dewey Brooke Aug 04 '21 at 14:51
  • There is no (official) vignette on `data.table` joins, yet. So please refer to `?data.table` and the `on` argument - with `on` you specify which variables to join on, and for non-equi joins also binary operators, like `<` and `>=`. See also several examples on the same help page. Cheers – Henrik Aug 04 '21 at 14:58
  • @Henrik you should add your solution as an answer. It's nearly as fast as the C++ solution when scaled up. – anjama Aug 05 '21 at 03:33
  • Honestly, your solution is the one I would have gone for. The only change I would make is replace `col <= x - 1` with `col < x` to remove an unnecessary subtraction. Henrik's and Roland's solutions are faster, so if you are actually in a situation where the performance is hurting you (please don't be doing premature optimization...), theirs are worth considering. But your solution is reasonably fast while still being easy to read, not needing a lot of memory, and not needing additional dependencies. Sometimes those things are worth giving up a little performance. – anjama Aug 05 '21 at 03:39
  • Thanks @anjama. To be more memory efficient you can add the count column to the original data by reference: `d[, N := d[d, on = .(dates < dates, dates >= from), .N, by = .EACHI]$N]` – Henrik Aug 05 '21 at 05:50

3 Answers3

7

If a dependency on Rcpp is not an issue, I like it for such tasks as it is easy to maintain.

library(Rcpp)
cppFunction('
  NumericVector count365(const NumericVector x) {
    // assumes that x is sorted
    
    size_t n = x.size(); 
    
    //initialize vector of zeros for counts
    NumericVector N = NumericVector(n);
    
    double lim;
    
    // start loop from second element of x
    for (size_t i = 1; i < n; ++i) {
      lim = x[i] - 365;
      
      //loop backwards from preceding element
      for (size_t j = i-1; j >= 0; --j) {
      
        //check if within 365 day range
        if (x[j] >= lim) {
          N[i]++;
        } else {
          break;
        }
      }
    }
    
    return N;
  }
')

df$numWithin365 <- count365(df$dates)
#        dates numWithin365
#1  2015-12-22            0
#2  2016-09-25            1
#3  2018-01-02            0
#4  2018-02-25            1
#5  2018-03-22            2
#6  2018-06-05            3
#7  2018-08-19            4
#8  2019-06-13            1
#9  2020-09-02            0
#10 2020-09-27            1
Roland
  • 127,288
  • 10
  • 191
  • 288
  • Thank you very much. I'm absolutely not familiar with Rccp, but this is easy to read and understand and seems (based on the benchmark in another anwer) very memory efficient. To expand this to a grouped situation (several dates per ID, calculations needed within ID), would you go to an lapply-approach, or build the grouping into the Rccp-function? – Dries Aug 09 '21 at 09:22
  • I just use package data.table for goup-by operations. I haven't had a case yet where it was a good use of my time to implement the group-by in the function. However, if you have millions of groups it might be worthwhile. – Roland Aug 09 '21 at 09:27
4

Here is a base approach that would be fast but would not be as memory efficient (or fast) as @Roland.

## assuming dates are ordered
tmp = as.matrix(dist(df$dates, method = 'manhattan')) < 365
colSums(tmp & upper.tri(tmp))

## 1  2  3  4  5  6  7  8  9 10 
## 0  1  0  1  2  3  4  1  0  1 

We are using the dist() function to efficiently calculate the distance between all elements. With this data, we can compare to our criteria (e.g., within 365 days).

Then, we use the fact that the dates were sorted to add the criteria that the dates have to be less than the current date. That is, for column 1, we know this is the first date in the series. Therefore, no other dates will be less than the date represented by column 1. For column 2, we know that column 1 is the only date that is less than. This pattern would continue on until column 10. This pattern is just upper.tri(matrix).

Performance

I was mainly curious at how much faster @Roland's would be. With the caveat that this probably should be a bigger dataset to really show off @Henrik's solution, is the winner by a lot.

bench::mark(
  cole_base = {
    tmp = as.matrix(dist(df$dates, method = 'manhattan')) < 365
    colSums(tmp & upper.tri(tmp))
  }
  , OP = {
    df %>% mutate(numWithin365 = within365(dates))
  }
  , henrik_dt = {
    d[ , from := dates - 365]
    d[d, on = .(dates < dates, dates >= from), .N, by = .EACHI]
  }
  , roland_rcpp = {
    count365(df$dates)
  }
  , ronak_fuzzyjoin = {
    fuzzy_left_join(df1, df1, by = c('dates1' = 'dates', 'dates'), 
                    match_fun = c(`<`, `>`)) %>%
      group_by(dates = dates.x) %>%
      summarise(numWithin365 = sum(!is.na(dates.y)))
  }
  , check = FALSE
)

## # A tibble: 5 x 13
##   expression           min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
##   <bch:expr>      <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
## 1 cole_base        108.3us  125.3us   7338.      6.23KB     4.38  3354     2
## 2 OP                3.64ms   4.36ms    206.      1.59KB     2.10    98     1
## 3 henrik_dt          4.5ms   4.94ms    200.    145.88KB     4.39    91     2
## 4 roland_rcpp        6.1us    6.7us 133645.      2.49KB    13.4   9999     1
## 5 ronak_fuzzyjoin 121.73ms 130.86ms      7.64  154.15KB     7.64     2     2
Cole
  • 11,130
  • 1
  • 9
  • 24
  • 2
    I ran your benchmarks with much larger data sets (presumably OP cares about perf because they actually have a large enough data set that it matters). The issue with your solution is that past 10,000 dates, the memory requirements for the matrix become impractical. With that said, the fuzzyjoin solution was substantially worse with both memory and speed and I had to cut it out of the benchmark early in the process of scaling it up. – anjama Aug 05 '21 at 03:32
  • Thanks @anjama. If you want to edit to include different benchmark or provide how you scaled, I would be happy to include in the answer. Note, it is not surprising that [tag:data.table] and [tag:Rcpp] are better than this - those would have been my approach had we not had those two awesome answers already! – Cole Aug 05 '21 at 11:17
2

We can create a new column subtracting 365 days from dates column and then use fuzzy_left_join to join based on date range.

library(fuzzyjoin)
library(dplyr)

df1 <- df %>% mutate(dates1 = dates - 365)

fuzzy_left_join(df1, df1, by = c('dates1' = 'dates', 'dates'), 
                match_fun = c(`<`, `>`)) %>%
  group_by(dates = dates.x) %>%
  summarise(numWithin365 = sum(!is.na(dates.y)))

#   dates      numWithin365
# * <date>            <int>
# 1 2015-12-22            0
# 2 2016-09-25            1
# 3 2018-01-02            0
# 4 2018-02-25            1
# 5 2018-03-22            2
# 6 2018-06-05            3
# 7 2018-08-19            4
# 8 2019-06-13            1
# 9 2020-09-02            0
#10 2020-09-27            1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213