-2

I have two dataframes that I would like to compare.

instances <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
         dates = as.POSIXct(c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00")))

ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
             start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
             end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00",NA)))

By id, I want to compare each date in the instances dataframe against its respective date ranges listed in the ranges dataframe. If there is no matching id in the ranges dataframe then it should return as FALSE, and if ranges$end is NA it should also return FALSE. The result should be as follows:

result <- data.frame(id = c("AED","AED","CFR","DRR","DRR","DRR","UN","PO"),
             dates = c("2018-05-17 09:52:00","2018-05-17 10:49:00","2018-05-17 10:38:00","2018-05-17 11:29:00","2018-05-17 12:12:00","2018-05-17 13:20:00","2018-05-17 14:28:00","2018-05-17 15:59:00"),
             inRange = c(FALSE, TRUE, TRUE, TRUE, FALSE, TRUE, TRUE, FALSE),
             outsideRange = c(TRUE, FALSE, FALSE, FALSE, TRUE, FALSE, FALSE, FALSE))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
Dfeld
  • 187
  • 9
  • I think David's answer here should go close to covering this - https://stackoverflow.com/a/41132376/496803 Something like: `instances[ranges, on=c("id","dates>=start","dates<=end"), hit := 1]` – thelatemail Jun 13 '18 at 05:33
  • 3
    There are two `id`'s with "DRR" in `ranges`. How would you know which one to select when? – Ronak Shah Jun 13 '18 at 05:42
  • Why do you need an inRange and outsideRange variable, and why aren't these for all rows in your desired result each other's opposites? – Lennyy Jun 13 '18 at 06:05

2 Answers2

2
library(dplyr)

instances %>% 
  full_join(ranges) %>% 
  mutate(inRange = case_when(dates >= start & dates <= end ~ T, T ~ F))

    id               dates               start                 end inRange
1  AED 2018-05-17 09:52:00 2018-05-17 10:00:00 2018-05-17 11:56:00 FALSE
2  AED 2018-05-17 10:49:00 2018-05-17 10:00:00 2018-05-17 11:56:00  TRUE
3  CFR 2018-05-17 10:38:00 2018-05-17 10:18:00 2018-05-17 12:23:00  TRUE
4  DRR 2018-05-17 11:29:00 2018-05-17 11:18:00 2018-05-17 12:01:00  TRUE
5  DRR 2018-05-17 11:29:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
6  DRR 2018-05-17 12:12:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
7  DRR 2018-05-17 12:12:00 2018-05-17 13:10:00 2018-05-17 14:18:00 FALSE
8  DRR 2018-05-17 13:20:00 2018-05-17 11:18:00 2018-05-17 12:01:00 FALSE
9  DRR 2018-05-17 13:20:00 2018-05-17 13:10:00 2018-05-17 14:18:00  TRUE
10  UN 2018-05-17 14:28:00 2018-05-17 14:18:00                <NA> FALSE
11  PO 2018-05-17 15:59:00                <NA>                <NA> FALSE
Lennyy
  • 5,932
  • 2
  • 10
  • 23
1

data.table solution

I would tackle this problem using the foverlaps() function from data.table... The only problem is that is only accepts complete date-ranges, and in the sample-data provided ranges[,5] has no enddate...

> ranges
   id               start                 end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5  UN 2018-05-17 14:18:00                <NA>

In order for the following solution to word, all ranges have to have a start AND an end. So, let's fill in the NA using some make up timestamp.

ranges <- data.frame(id = c("AED","CFR","DRR","DRR","UN"),
                     start = as.POSIXct(c("2018-05-17 10:00:00","2018-05-17 10:18:00","2018-05-17 11:18:00","2018-05-17 13:10:00","2018-05-17 14:18:00")),
                     end = as.POSIXct(c("2018-05-17 11:56:00","2018-05-17 12:23:00","2018-05-17 12:01:00","2018-05-17 14:18:00", "2018-05-17 16:18:00")))

> ranges
   id               start                 end
1 AED 2018-05-17 10:00:00 2018-05-17 11:56:00
2 CFR 2018-05-17 10:18:00 2018-05-17 12:23:00
3 DRR 2018-05-17 11:18:00 2018-05-17 12:01:00
4 DRR 2018-05-17 13:10:00 2018-05-17 14:18:00
5  UN 2018-05-17 14:18:00 2018-05-17 16:18:00

Workflow

library(data.table)
#make instances a data.table without key
instances.dt <- setDT( instances, key = NULL )
#create a data.table with the ranges, set keys 
ranges.dt <- setDT( ranges, key = c("id", "start", "end") )

#create a temporary 'range', where start == end, based on the dates-column
instances.dt[, c( "start", "end") := dates]

#create a column 'inRange' using data.table's foverlaps(). 
#use the secons column of the fovelaps' result. If  this column is NA, then no 'hit' was found 
#in ranges.dt and inrange == FALSE, else inRange == TRUE
instances.dt[, inRange := !is.na( foverlaps(instances.dt, ranges.dt, type = "within", mult = "first", nomatch = NA)[,2] )]

#outsideRange is the opposite of inRange
instances.dt[, outsideRange := !inRange]

#remove the temporary columns 'start' and 'end'
instances.dt[, c("start", "end") := NULL]

Result

> instances.dt
    id               dates inRange outsideRange
1: AED 2018-05-17 09:52:00   FALSE         TRUE
2: AED 2018-05-17 10:49:00    TRUE        FALSE
3: CFR 2018-05-17 10:38:00    TRUE        FALSE
4: DRR 2018-05-17 11:29:00    TRUE        FALSE
5: DRR 2018-05-17 12:12:00   FALSE         TRUE
6: DRR 2018-05-17 13:20:00    TRUE        FALSE
7:  UN 2018-05-17 14:28:00    TRUE        FALSE
8:  PO 2018-05-17 15:59:00   FALSE         TRUE

This works blazingly fast, even for huge data.tables.

You can shorten the code, but I always like to do the analysis one step at a time, improving readability.

Chained using magrittr's pipe-operator

library(data.table)
library(magrittr)

ranges.dt <- setDT( ranges, key = c("id", "start", "end") )
result <- setDT( instances, key = NULL ) %>% 
  .[, c( "start", "end") := dates] %>%
  .[, inRange := !is.na( foverlaps( ., ranges.dt, type = "within", mult = "first", nomatch = NA )[,2] )] %>%
  .[, outsideRange := !inRange] %>%
  .[, c("start", "end") := NULL]
Wimpel
  • 26,031
  • 1
  • 20
  • 37