I have a dataframe containing repeated measurements of a number of variables for a large number of samples. I would like to look at each variable / sample combination, compare the measurement dates with the dates in another column, and only keep the row where the minimum date difference exists.
start_date <- as.Date('2021-01-01')
end_date <- as.Date('2021-12-30')
set.seed(1984)
# The dataframe looks something like this
cheese <- data.frame(sample_id = c('1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'),
variable = c('a','a','b','b','a','a','b','b',
'a','a','b','b','a','a','b','b'),
value = runif(n = 16, min = 1, max = 10),
measurement_date = c('2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22',
'2021-05-04','2021-08-22','2021-05-04','2021-08-22'),
date2 = as.Date(sample( as.numeric(start_date): as.numeric(end_date), 16,
replace = T),
origin = '1970-01-01'))
And I'd like it to end up like this:
sample_id variable measurement_date date2
1 a 2021-05-04 2021-06-08
1 b 2021-05-04 2021-03-21
2 a 2021-05-04 2021-01-27
2 b 2021-05-04 2021-03-15
1 a 2021-08-22 2021-09-10
1 b 2021-08-22 2021-03-22
2 a 2021-08-22 2021-11-27
2 b 2021-08-22 2021-08-13
I know that the answer falls somewhere in the realm of this similar question, but my example has a different structure that I can't seem to wrap my head around. I've tried using dplyr's filter option, but my solution only returns a single row where the smallest difference across all samples occurs. It doesn't do this for every sample / variable combination.
library(dplyr)
filtered <- cheese %>% filter(abs(difftime(measurement_date,date2)) == min(abs(difftime(measurement_date,date2))))