0

I have 2 tables (data1 and data2). My goal is to keep data in data2 that matches with "Site" and "Date" in data1 and also is within +/- 1 hour of data1. I have an example of what I am looking for below:

data1 <- data.frame("Site" = c("ABC", "ABC", "ABC", "EFG", "EFG", "EFG", "EFG"), 
"Date" = c("2022-02-02", "2022-02-02", "2022-02-05", "2011-01-01", "2011-01-01", 
"2012-01-03", "2012-01-05"),"Time" = c("12:00", "13:00", "15:00", "20:00", "20:30", 
"21:59", "23:00"))

data2 <- data.frame("Site" = c("ABC", "ABC", "ABC", "EFG", "EFG", "EFG"), "Date" = 
c("2022-02-02", "2022-02-02", "2022-02-05", "2011-01-01", "2011-01-01", "2012-01-05"),"Time" = c("12:00", "14:30", "16:00", "20:00", "23:00", "22:00"))

data_final <- data.frame("Site" = c("ABC", "ABC", "EFG", "EFG"), "Date" = c("2022-02-02", "2022-02-05", "2011-01-01", "2022-01-05"), "Time" = c("12:00", "16:00", "20:00", 
"22:00"))

data_final is what I would like to end with. data1 has more rows than data2 does. I want to filter data2 to keep data that is within +/- 1 hour matching with date and site. Thank you for your help and let me know if you have any questions :)

r2evans
  • 141,215
  • 6
  • 77
  • 149
Sarah
  • 411
  • 4
  • 14
  • To use the dupe answers, you'll need to either (a) combine `Date` and `Time` into `POSIXct` timestamps, or (b) numerify `Time` in one of many ways. I suggest the `POSIXct` route for several reasons, least of which being that in my experience it seems likely that other processing will be using these columns in a number-like way and intending to find time ranges or differences numerically, and that's typically the best route. – r2evans Jul 06 '22 at 15:08
  • @r2evans thank you for your response, I'm not very well-versed in R, so I'm not understanding your comment. I also took a look at the questions that you said have similar answers and neither help with exactly what Im looking for. Is there anyway you can provide example code that will create the data_final? – Sarah Jul 06 '22 at 15:14
  • First do `data1$DateTime <- as.POSIXct(sprintf("%s %s:00", data1$Date, data1$Time))` and the same for `data2`. Then (using dplyr) `fuzzyjoin::fuzzy_inner_join(data2, data1, by = c("Site", "DateTime"), match_fun = list(`==`, function(a, b) {a >= (b-3600) & a <= (b+3600);})) %>% select(ends_with(".x")) %>% distinct()`. – r2evans Jul 06 '22 at 15:22
  • Or `unique(sqldf::sqldf("select t2.* from data2 t2 inner join data1 t1 on t2.Site=t1.Site and t2.DateTime between (t1.DateTime-3600) and (t1.DateTime+3600)"))` – r2evans Jul 06 '22 at 15:23
  • @r2evans I get an error with the list( == in the code you gave me – Sarah Jul 07 '22 at 01:01
  • darn it, I didn't escape the backticks ... `match_fun=list(\`==\`, function(a,b)...)` – r2evans Jul 07 '22 at 10:56
  • @r2evans thank you, still got another error though Error in which(m) : argument to 'which' is not logical – Sarah Jul 07 '22 at 13:23
  • I don't know what to say. The code I provided (fixed) works on the data you provided. – r2evans Jul 07 '22 at 13:29
  • data1$DateTime <- as.POSIXct(sprintf("%s %s:00", data1$Date, data1$Time)) data2$DateTime <- as.POSIXct(sprintf("%s %s:00", data2$Date, data2$Time)) fuzzyjoin::fuzzy_inner_join(data2, data1, by = c("Site", "DateTime"), match_fun = list('==', function(a, b) {a >= (b-3600) & a <= (b+3600);})) %>% select(ends_with(".x")) %>% distinct() This is what I'm running and I get that error with my data above @r2evans – Sarah Jul 07 '22 at 13:34
  • 1
    @r2evans nevermind, I see I used the wrong `` lol, thank you!!! – Sarah Jul 07 '22 at 13:36

0 Answers0