I have the following dataset:
Data:
I would like to sum, by group, the values of all events that started within 60 days (already calculated interval = Start_interval, End_interval) without adding the same row in more than one interval.
Expected output:
I did research and found some solutions that allowed me to obtain, until then, the results displayed below, very close to what I expected. For example: In R: how to sum a variable by group between two dates
The difference between the questions is that I need the sum to be performed without adding, by group, the same row in more than one period.
Result I got so far:
Does anyone have any suggestions?
Reproducible example:
# Input data
data <- data.table(id = c("Group A", "Group A", "Group A", "Group A",
"Group A", "Group A"),
start_date_event = c("2019-09-15",
"2019-11-24",
"2020-04-19",
"2020-04-25",
"2020-05-25",
"2020-10-27"),
end_date_event = c("2019-09-24",
"2019-11-28",
"2020-04-23",
"2020-04-29",
"2020-05-27",
"2020-11-06"),
start_interval = c("2019-09-15",
"2019-11-24",
"2020-04-19",
"2020-04-25",
"2020-05-25",
"2020-10-27"),
end_interval = c("2019-11-14",
"2020-01-23",
"2020-06-18",
"2020-06-24",
"2020-07-24",
"2020-12-26"),
value = c(9, 4, 4, 4, 2, 15))
# Convert to date
data <- data %>%
dplyr::mutate(start_date_event = as.Date(start_date_event)) %>%
dplyr::mutate(end_date_event = as.Date(end_date_event)) %>%
dplyr::mutate(start_interval = as.Date(start_interval)) %>%
dplyr::mutate(end_interval = as.Date(end_interval))
# Calculating with non-equi join
temp <- data[data,
on = .(start_date_event <= end_interval,
end_date_event >= start_interval)][,
.(value_sum = sum(value)),
by = .(id, start_date_event)]
# Get all
data <- merge(data, temp, all.x = T,
by.x = c("id", "end_interval"),
by.y = c("id", "start_date_event"))
Thanks!