since a few days I'm working with R and now I have my first problem. I am a complete beginner in R.
I want to check the data of dt between the **start date **and the end date and then write the start date and end date each as a column in dt.
Then I want to order the data in descending order by datetime within the group (ID, start date, end date). For this I would then use the following:
dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]
My data and search parameters look like this:
library(data.table)
# data.table
dt <- data.table(ID = 1:5
, datetime = seq(from = as.Date("01.01.2020", format = "%d.%m.%Y"), by ="month", length.out = 20)
, var = rnorm(20))
# Search parameters
start <- c(as.Date("01.01.2020", format = "%d.%m.%Y"), as.Date("01.07.2020", format = "%d.%m.%Y"))
end <- c(as.Date("01.03.2020", format = "%d.%m.%Y"), as.Date("01.05.2021", format = "%d.%m.%Y"))
range <- data.table(start = start, end = end)
Now I have tried several filter versions and wonder why between() does not work but inrange() does?
# Version 0: Why it doesnt work with "between"
filtered0 <- dt[between(x= datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
filtered0
ID datetime var
1: 1 2020-01-01 0.50298317
2: 3 2020-03-01 2.06969108
3: 3 2020-08-01 0.62906692
4: 5 2020-10-01 0.75532708
5: 2 2020-12-01 0.59471124
6: 4 2021-02-01 1.54694582
7: 1 2021-04-01 0.07542291
# Version 1:
filtered1 <- setDT(dt)[datetime %inrange% range]
filtered1
# Version 2:
filtered2<- dt[datetime %inrange% range]
# Version 3:
filtered3 <- setDT(dt)[inrange(x = datetime, lower = range$start, upper = range$end, incbounds=TRUE)]
Now I tried for the first time to write the start and end date into the data.table and wonder why the column "start" now contains numbers although the format as.Date(..., , format = "%d.%m.%Y") does not work.
filtered4 <- setDT(dt)[, `:=` (start = ifelse(datetime %inrange% range, range$start[1], range$start[1])
, end = ifelse(datetime %inrange% range, range$end[1], range$end[1] ) )]
ID datetime var start end
1: 1 2020-01-01 0.50298317 18262 18322
2: 2 2020-02-01 -0.26148738 18262 18322
3: 3 2020-03-01 2.06969108 18262 18322
4: 4 2020-04-01 1.99797128 18262 18322
5: 5 2020-05-01 -0.93218939 18262 18322
6: 1 2020-06-01 0.35842662 18262 18322
7: 2 2020-07-01 -1.07181184 18262 18322
8: 3 2020-08-01 0.62906692 18262 18322
9: 4 2020-09-01 1.34515866 18262 18322
10: 5 2020-10-01 0.75532708 18262 18322
11: 1 2020-11-01 0.89861745 18262 18322
12: 2 2020-12-01 0.59471124 18262 18322
13: 3 2021-01-01 -0.68267021 18262 18322
14: 4 2021-02-01 1.54694582 18262 18322
15: 5 2021-03-01 -0.05550008 18262 18322
16: 1 2021-04-01 0.07542291 18262 18322
17: 2 2021-05-01 0.34271740 18262 18322
18: 3 2021-06-01 1.04142837 18262 18322
19: 4 2021-07-01 0.75517388 18262 18322
20: 5 2021-08-01 0.50562581 18262 18322
The above code is of course not correct, that datetime is between start and end date, because I specify the first element of the search parameters hard. This gave me the expected result, but only for the first daterange. I wonder if this doesn't work with inrange() and I should use dplyr::between with rowwise() instead? But how can I do this, because the following does not work either.
filtered5 <- setDT(dt)[, `:=` (start = range[, start], ende = range[, end])]
The last step is to arrange them. For this I use the following. Is this the right wayo or should I use RANK(), DENSE_RANK(), ROW_NUMBER(). If all three functions are similar to SQL?
dt[ , valRank:=rank(-datetime), by= list(id, start date, end date)]
Any help would be very welcome. Sorry if it is written a bit jumbled, but these are my first steps in R and this is also my first question in this forum.
I wish you all a nice evening. Best regards