Filter data.table between multiple date ranges in R

Question

since a few days I'm working with R and now I have my first problem. I am a complete beginner in R.

I want to check the data of dt between the **start date **and the end date and then write the start date and end date each as a column in dt.

Then I want to order the data in descending order by datetime within the group (ID, start date, end date). For this I would then use the following:

dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]

My data and search parameters look like this:

library(data.table)

# data.table
dt <- data.table(ID = 1:5
                 , datetime = seq(from = as.Date("01.01.2020", format = "%d.%m.%Y"), by ="month", length.out = 20)
                 , var = rnorm(20))

# Search parameters
start <- c(as.Date("01.01.2020", format = "%d.%m.%Y"), as.Date("01.07.2020", format = "%d.%m.%Y"))
end <- c(as.Date("01.03.2020", format = "%d.%m.%Y"), as.Date("01.05.2021", format = "%d.%m.%Y"))
range <- data.table(start = start, end = end)

Now I have tried several filter versions and wonder why between() does not work but inrange() does?

# Version 0: Why it doesnt work with "between"
filtered0 <- dt[between(x= datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
filtered0
ID   datetime        var
1:  1 2020-01-01 0.50298317
2:  3 2020-03-01 2.06969108
3:  3 2020-08-01 0.62906692
4:  5 2020-10-01 0.75532708
5:  2 2020-12-01 0.59471124
6:  4 2021-02-01 1.54694582
7:  1 2021-04-01 0.07542291
# Version 1: 
filtered1 <- setDT(dt)[datetime %inrange% range]
filtered1
# Version 2:
filtered2<- dt[datetime %inrange% range]
# Version 3:
filtered3 <- setDT(dt)[inrange(x = datetime, lower = range$start, upper = range$end, incbounds=TRUE)]

Now I tried for the first time to write the start and end date into the data.table and wonder why the column "start" now contains numbers although the format as.Date(..., , format = "%d.%m.%Y") does not work.

filtered4 <- setDT(dt)[, `:=` (start = ifelse(datetime %inrange% range, range$start[1], range$start[1]) 
                               , end = ifelse(datetime %inrange% range, range$end[1], range$end[1] ) )]

   ID   datetime         var start   end
 1:  1 2020-01-01  0.50298317 18262 18322
 2:  2 2020-02-01 -0.26148738 18262 18322
 3:  3 2020-03-01  2.06969108 18262 18322
 4:  4 2020-04-01  1.99797128 18262 18322
 5:  5 2020-05-01 -0.93218939 18262 18322
 6:  1 2020-06-01  0.35842662 18262 18322
 7:  2 2020-07-01 -1.07181184 18262 18322
 8:  3 2020-08-01  0.62906692 18262 18322
 9:  4 2020-09-01  1.34515866 18262 18322
10:  5 2020-10-01  0.75532708 18262 18322
11:  1 2020-11-01  0.89861745 18262 18322
12:  2 2020-12-01  0.59471124 18262 18322
13:  3 2021-01-01 -0.68267021 18262 18322
14:  4 2021-02-01  1.54694582 18262 18322
15:  5 2021-03-01 -0.05550008 18262 18322
16:  1 2021-04-01  0.07542291 18262 18322
17:  2 2021-05-01  0.34271740 18262 18322
18:  3 2021-06-01  1.04142837 18262 18322
19:  4 2021-07-01  0.75517388 18262 18322
20:  5 2021-08-01  0.50562581 18262 18322

The above code is of course not correct, that datetime is between start and end date, because I specify the first element of the search parameters hard. This gave me the expected result, but only for the first daterange. I wonder if this doesn't work with inrange() and I should use dplyr::between with rowwise() instead? But how can I do this, because the following does not work either.

filtered5 <- setDT(dt)[, `:=` (start = range[, start], ende = range[, end])]

The last step is to arrange them. For this I use the following. Is this the right wayo or should I use RANK(), DENSE_RANK(), ROW_NUMBER(). If all three functions are similar to SQL?

dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]

Any help would be very welcome. Sorry if it is written a bit jumbled, but these are my first steps in R and this is also my first question in this forum.

I wish you all a nice evening. Best regards

I'm a little confused by this code: `ifelse(datetime %inrange% range, range$start[1], range$start[1])`. This says for instance, *"if it is in range then start1 else start1"*, in which case drop `ifelse` and just assign `start = range$start[1]`. FYI, `base::ifelse` is not [class safe](https://stackoverflow.com/q/6668963/3358272), it should not be used with any class that is not of `logical`, `integer`, `numeric`, or `character` (notably absent from that list are `Date` and `POSIXt`). — r2evans, Apr 26 '23 at 22:03
btw, until `dplyr_1.1.0`, the implementation of `dplyr::between` was limited in that the `left=` and `right=` arguments must be length-1, while `data.table::between` accepted vectors. Neither are appropriate for this, since even in `dplyr::between` in the newer packages, they both require that the vectors all have the same lengths and be applied element-wise, which is not what you want. — r2evans, Apr 26 '23 at 22:46

DocBuckets · Answer 1 · 2023-04-26T22:27:25.557

Welcome. In general, it is useful to show us explicitly what your desired output is. Also, showing your inputs in expanded form is useful so highly skilled users don't have to run the code but rather simply type the answer here.

Here is a full working example of what I think you want:

Only required package:

library(dplyr)

Input dataset dt:

dt <-
  tibble(
    ID = rep(1:5, 4),
    datetime = seq(
      from = as.Date("01.01.2020", format = "%d.%m.%Y"),
      by = "month",
      length.out = 20
    )  ,
    var = rnorm(20)
  )

ranges dataset (bad idea to use "range" as a var name, as it is also a function)

ranges <-
  tibble(start = c(
    as.Date("01.01.2020", format = "%d.%m.%Y"),
    as.Date("01.07.2020", format = "%d.%m.%Y")
  ),
  end = c(
    as.Date("01.03.2020", format = "%d.%m.%Y"),
    as.Date("01.05.2021", format = "%d.%m.%Y")
  ))

The solution:

"Take dt, cross it exhastively with all rows of ranges, then filter to include only those rows where datetime is between start and end.

dt %>% 
  crossing(ranges) %>% 
  filter(datetime > start & datetime < end)

Result:

ID  datetime    var     start       end
1   2020-11-01  1.221   2020-07-01  2021-05-01
1   2021-04-01  0.021   2020-07-01  2021-05-01
2   2020-02-01  -0.18   2020-01-01  2020-03-01
2   2020-12-01  1.585   2020-07-01  2021-05-01
3   2020-08-01  -0.55   2020-07-01  2021-05-01
3   2021-01-01  0.554   2020-07-01  2021-05-01
4   2020-09-01  -0.66   2020-07-01  2021-05-01
4   2021-02-01  1.504   2020-07-01  2021-05-01
5   2020-10-01  0.659   2020-07-01  2021-05-01
5   2021-03-01  0.202   2020-07-01  2021-05-01

You can use arrange() as another piped line if you want to sort it another way e.g.:

dt %>% 
  crossing(range) %>% 
  filter(datetime > start & datetime < end) %>% 
  arrange(datetime, ID, start)

"Only required package:" and all [the imports](https://cran.r-project.org/web/packages/dplyr/index.html) that come with it. — SymbolixAU, Apr 26 '23 at 22:37
@SymbolixAU, _far_ better than the more popular use these days of `library(tidyverse)`. — r2evans, Apr 26 '23 at 22:48
@SymbolixAU True that it's a heavy hitter package, but the computer does the hard work for you in working out all the imports. You just have to remember and understand one thing: dplyr. Plus, OP is new to R. dplyr syntax is intuitive and helps one get used to thinking in the context of data management. I think this is the most beginner-friendly way to get the result that OP wanted. — DocBuckets, Apr 27 '23 at 00:30

r2evans · Answer 2 · 2023-04-26T22:39:09.897

This seems like a non-equi join.

library(data.table)
range[dt, on = .(start <= datetime, end >= datetime)]
#          start        end    ID          var
#         <Date>     <Date> <int>        <num>
#  1: 2020-01-01 2020-01-01     1 -0.535432498
#  2: 2020-02-01 2020-02-01     2 -0.713242072
#  3: 2020-03-01 2020-03-01     3 -0.246650715
#  4: 2020-04-01 2020-04-01     4  0.598074968
#  5: 2020-05-01 2020-05-01     5  1.321493645
#  6: 2020-06-01 2020-06-01     1  0.431644671
#  7: 2020-07-01 2020-07-01     2 -0.035380524
#  8: 2020-08-01 2020-08-01     3 -0.764943081
#  9: 2020-09-01 2020-09-01     4  1.123448021
# 10: 2020-10-01 2020-10-01     5  0.059585556
# 11: 2020-11-01 2020-11-01     1  0.987138106
# 12: 2020-12-01 2020-12-01     2 -0.229781596
# 13: 2021-01-01 2021-01-01     3 -1.446177021
# 14: 2021-02-01 2021-02-01     4 -0.364705735
# 15: 2021-03-01 2021-03-01     5  0.007079030
# 16: 2021-04-01 2021-04-01     1 -1.967512640
# 17: 2021-05-01 2021-05-01     2 -0.009452937
# 18: 2021-06-01 2021-06-01     3 -1.960379639
# 19: 2021-07-01 2021-07-01     4  0.704577974
# 20: 2021-08-01 2021-08-01     5  0.805717998
#          start        end    ID          var

The way that data.table preserves columns can be confusing or frustrating, noting that what appears to be range$start and end are more reminiscent of copies of the original dt$datetime. When doing things like this, for clarity I often copy columns into throw-away columns for the join.

out <- range[,c("s", "e") := .(start, end)
  ][dt, on = .(s <= datetime, e >= datetime)
  ][, valrank := rank(-xtfrm(s)), by = .(ID, start, end)
  ][, e := NULL ] |>
  setnames("s", "datetime")
out
#          start        end   datetime    ID          var valrank
#         <Date>     <Date>     <Date> <int>        <num>   <num>
#  1: 2020-01-01 2020-03-01 2020-01-01     1 -0.535432498       1
#  2: 2020-01-01 2020-03-01 2020-02-01     2 -0.713242072       1
#  3: 2020-01-01 2020-03-01 2020-03-01     3 -0.246650715       1
#  4:       <NA>       <NA> 2020-04-01     4  0.598074968       2
#  5:       <NA>       <NA> 2020-05-01     5  1.321493645       2
#  6:       <NA>       <NA> 2020-06-01     1  0.431644671       1
#  7: 2020-07-01 2021-05-01 2020-07-01     2 -0.035380524       3
#  8: 2020-07-01 2021-05-01 2020-08-01     3 -0.764943081       2
#  9: 2020-07-01 2021-05-01 2020-09-01     4  1.123448021       2
# 10: 2020-07-01 2021-05-01 2020-10-01     5  0.059585556       2
# 11: 2020-07-01 2021-05-01 2020-11-01     1  0.987138106       2
# 12: 2020-07-01 2021-05-01 2020-12-01     2 -0.229781596       2
# 13: 2020-07-01 2021-05-01 2021-01-01     3 -1.446177021       1
# 14: 2020-07-01 2021-05-01 2021-02-01     4 -0.364705735       1
# 15: 2020-07-01 2021-05-01 2021-03-01     5  0.007079030       1
# 16: 2020-07-01 2021-05-01 2021-04-01     1 -1.967512640       1
# 17: 2020-07-01 2021-05-01 2021-05-01     2 -0.009452937       1
# 18:       <NA>       <NA> 2021-06-01     3 -1.960379639       1
# 19:       <NA>       <NA> 2021-07-01     4  0.704577974       1
# 20:       <NA>       <NA> 2021-08-01     5  0.805717998       1
#          start        end   datetime    ID          var valrank

This is 5 lines and almost 200 characters, while the dplyr solution is 3 lines and 76 characters (and is a one-line-per-action pipe format). I think the dplyr solution is far more straightforward. — DocBuckets, Apr 26 '23 at 22:26
The OP tagged [tag:data.table] and demonstrated a problem that suggests a non-equi join. Don't judge `data.table` semantics, in several areas they can be far superior in performance to `dplyr` pipes, with a _subjective_ risk of not looking as one-line-per-action pretty. (BTW, I believe your output to be incorrect, "straightforward code" or otherwise.) — r2evans, Apr 26 '23 at 22:55

Filter data.table between multiple date ranges in R

2 Answers2