2

I have a dataframe containing repeated measurements of a number of variables for a large number of samples. I would like to look at each variable / sample combination, compare the measurement dates with the dates in another column, and only keep the row where the minimum date difference exists.

start_date <- as.Date('2021-01-01')  
end_date <- as.Date('2021-12-30') 
set.seed(1984)
# The dataframe looks something like this
cheese <- data.frame(sample_id = c('1','1','1','1','2','2','2','2','1','1','1','1','2','2','2','2'),
                 variable = c('a','a','b','b','a','a','b','b',
                          'a','a','b','b','a','a','b','b'),
             value = runif(n = 16, min = 1, max = 10),
             measurement_date = c('2021-05-04','2021-08-22','2021-05-04','2021-08-22',
                                  '2021-05-04','2021-08-22','2021-05-04','2021-08-22',
                                  '2021-05-04','2021-08-22','2021-05-04','2021-08-22',
                                  '2021-05-04','2021-08-22','2021-05-04','2021-08-22'),
             date2 = as.Date(sample( as.numeric(start_date): as.numeric(end_date), 16, 
                                     replace = T), 
                             origin = '1970-01-01'))

And I'd like it to end up like this:

sample_id   variable   measurement_date    date2
1           a          2021-05-04          2021-06-08
1           b          2021-05-04          2021-03-21
2           a          2021-05-04          2021-01-27
2           b          2021-05-04          2021-03-15
1           a          2021-08-22          2021-09-10
1           b          2021-08-22          2021-03-22
2           a          2021-08-22          2021-11-27
2           b          2021-08-22          2021-08-13

I know that the answer falls somewhere in the realm of this similar question, but my example has a different structure that I can't seem to wrap my head around. I've tried using dplyr's filter option, but my solution only returns a single row where the smallest difference across all samples occurs. It doesn't do this for every sample / variable combination.

library(dplyr)
filtered <- cheese %>% filter(abs(difftime(measurement_date,date2)) == min(abs(difftime(measurement_date,date2))))
Squan Schmaan
  • 323
  • 1
  • 8
  • 1
    You just need to group your variables first, `group_by(sample_id, variable, measurement_date)`. – caldwellst Mar 07 '22 at 06:48
  • The answers to this similar question could get you started: https://stackoverflow.com/questions/63770101/join-with-fuzzy-matching-by-date-in-r –  Mar 07 '22 at 07:37

1 Answers1

2

Based on the comment given by @caldwellst, I didn't group the variables first, which is why I was getting a single value, not one for each grouping:

library(dplyr)
filtered <- cheese %>% 
    group_by(sample_id, variable, measurement_date) %>%
    filter(abs(difftime(measurement_date,date2)) == min(abs(difftime(measurement_date,date2))))
Squan Schmaan
  • 323
  • 1
  • 8