subset data.frame base on a time interval + or - list of dates

Question

I have a large (20,000 obs) data.frame containing hourly values and grouped by unique id. I also have a list of dates (each of the dates occurs in the data.frame). I am trying to match the dates to the data.frame, and then extract datetimes that are between + or – a certain time interval from the matching date. For example, in the following data.frame:

 setAs("character","myDate", function(from) as.POSIXct(from, "%m/%e/%Y    %H:%M", tz="UTC")) 
# previous function formats date input as UTC 
   df <- read.table(textConnection("datetimeUTC id  value
                             '5/1/2013 5:00'    153 0.53
                            '5/1/2013 6:00'     153 0.46
                            '5/1/2013 7:00'     153 0.53
                            '5/1/2013 8:00'     153 0.46
                            '5/1/2013 9:00'     153 0.44
                            '5/1/2013 10:00'    153 0.48
                            '5/1/2013 11:00'    153 0.49
                            '5/1/2013 12:00'    153 0.49
                            '5/1/2013 13:00'    153 0.51
                            '5/1/2013 14:00'    153 0.53
                            '11/24/2013 9:00'   154 0.45
                            '11/24/2013 10:00'  154 0.46
                            '11/24/2013 11:00'  154 0.49
                            '11/24/2013 12:00'  154 0.55
                            '11/24/2013 13:00'  154 0.61
                            '11/24/2013 14:00'  154 0.7
                            '11/24/2013 15:00'  154 0.74
                            '11/24/2013 16:00'  154 0.78
                            '11/24/2013 17:00'  154 0.77
                            '11/24/2013 18:00'  154 0.79
                            '8/2/2015 1:00'     240 0.2
                            '8/2/2015 2:00'     240 0.2
                            '8/2/2015 3:00'     240 0.2
                            '8/2/2015 4:00'     240 0.22
                            '8/2/2015 5:00'     240 0.22
                            '8/2/2015 6:00'     240 0.27
                            '8/2/2015 7:00'     240 0.23
                            '8/2/2015 8:00'     240 0.21
                            '8/2/2015 9:00'     240 0.22
                            '8/2/2015 10:00'    240 0.22
                            '8/2/2015 11:00'    240 0.21
                            '8/2/2015 12:00'    240 0.21
                            '8/2/2015 13:00'    240 0.21
                            '8/2/2015 14:00'    240 0.22
                            '8/2/2015 15:00'    240 0.24
                            '8/2/2015 16:00'    240 0.25
                            '8/2/2015 17:00'    240 0.12
                            '8/2/2015 18:00'    240 0.32
                            "), header=TRUE, colClasses=c("myDate", "character", "numeric"))

I want to extract, for each id, all observations that are 2 hours before or after the matching datetime from this key:

  key <-read.table(textConnection("
     datetimeUTC        id
    '5/1/2013 9:00'     153
    '11/24/2013 14:00'  154
    '8/2/2015 5:00'     240
    '8/2/2015 15:00'        240"), header=TRUE, colClasses=c("myDate",  "character"))

The desired result would look as follows:

  result <- read.table(textConnection("datetimeUTC  id  value
                            '5/1/2013 7:00'     153 0.53
                            '5/1/2013 8:00'     153 0.46
                            '5/1/2013 9:00'     153 0.44
                            '5/1/2013 10:00'    153 0.48
                            '5/1/2013 11:00'    153 0.49
                            '11/24/2013 12:00'  154 0.55
                            '11/24/2013 13:00'  154 0.61
                            '11/24/2013 14:00'  154 0.7
                            '11/24/2013 15:00'  154 0.74
                            '11/24/2013 16:00'  154 0.78
                            '8/2/2015 3:00'     240 0.2
                            '8/2/2015 4:00'     240 0.22
                            '8/2/2015 5:00'     240 0.22
                            '8/2/2015 6:00'     240 0.27
                            '8/2/2015 7:00'     240 0.23
                            '8/2/2015 13:00'    240 0.21
                            '8/2/2015 14:00'    240 0.22
                            '8/2/2015 15:00'    240 0.24
                            '8/2/2015 16:00'    240 0.25
                            '8/2/2015 17:00'    240 0.12
                            "), header=TRUE, colClasses=c("myDate", "character", "numeric"))

Seems like a simple task but I can't seem to get what I want. A couple of things that I have tried.

result <-df[which(df$id == key$id &(df$datetimeUTC >= key$datetimeUTC -2*60*60 |df$datetimeUTC <= key$datetimeUTC + 2*60*60 )),]

 library(data.table)
  dt <- setDT(df)
  dt[dt$datetimeUTC %between% c(dt$datetimeUTC - 2*60*60,dt$datetimeUTC +   2*60*60) ]

for Id 153 , in your output why do you have 8:00 ?? shouldnt it just be 7:00 and 9:00 considering you want "2 hrs before or after" — CuriousBeing, Feb 22 '16 at 22:30
I edited to make more clear I am looking to extract all dates in between plus or -2 hours from the matching date — Wyldsoul, Feb 22 '16 at 22:44

score 4 · Accepted Answer · edited May 23 '17 at 11:48

4

A couple of data.table solutions for you

1. Cartesian Join

join it all together, then filter out the ones you don't want

library(data.table)
dt <- as.data.table(df)
dt_key <- as.data.table(key)

dt_join <- dt[ dt_key, on="id", allow.cartesian=T][difftime(i.datetimeUTC, datetimeUTC, units="hours") <= 2 & difftime(i.datetimeUTC, datetimeUTC, units="hours") >= -2]

 #          datetimeUTC  id value       i.datetimeUTC
 #1: 2013-05-01 07:00:00 153  0.53 2013-05-01 09:00:00
 #2: 2013-05-01 08:00:00 153  0.46 2013-05-01 09:00:00
 #3: 2013-05-01 09:00:00 153  0.44 2013-05-01 09:00:00
 #4: 2013-05-01 10:00:00 153  0.48 2013-05-01 09:00:00
   ... etc

2. Condition on EACH I

Making use of an answer to one of my previous questions, specify the condition in j that EACHI has to meet in the join.

dt[ dt_key, 
        { idx = difftime(i.datetimeUTC, datetimeUTC, units="hours") <= 2 & difftime(i.datetimeUTC, datetimeUTC, units="hours") >= -2
        .(datetime = datetimeUTC[idx],
            value = value[idx])
            },
        on=c("id"),
        by=.EACHI]

edited May 23 '17 at 11:48

Community

1
1

answered Feb 22 '16 at 22:45

tospig

7,762
14
40
79

Thanks tspig, both of these solutions work on my sample data, I'll give them a try on the full dataset tomorrow at work. – Wyldsoul Feb 22 '16 at 23:29
@Wyldsoul - no problem. Depending on the size of your data, the `cartesian` join could use up your RAM, but if not it should run quicker. – tospig Feb 22 '16 at 23:32
1

Both solutions were equally efficient on my full data set (took less the 1 sec), but the cartesian join had the added benefit of assigning the unique i.datetimeUTC var for each time interval, which is useful to me. Thanks again! – Wyldsoul Feb 23 '16 at 16:20
@tospig, great answer (already upvoted). Just wanted to let you know of the recent developments in data.table, non-equi joins.. I've provided an answer. Cheers. – Arun Jul 29 '16 at 16:26
1

@Arun - thanks: I've been watching / using the dev versions and the non-equi join is a great feature. Thanks for implementing it. – tospig Jul 31 '16 at 01:56

score 4 · Answer 2 · answered Jul 29 '16 at 16:25

@Tospig's solution is very nice. But now, with the newly implemented non-equi joins feature in the current development version of data.table, this is quite straightforward:

require(data.table) # v1.9.7+
setDT(df)
setDT(key) ## converting data.frames to data.tables by reference
df[key, .(x.datetimeUTC, i.datetimeUTC, id, value), 
  on=.(datetimeUTC >= d1, datetimeUTC <= d2), nomatch=0L]

That's it.

Note that this performs a conditional join directly and is therefore both memory efficient (as opposed to performing a cartesian join and then filtering based on a condition) and fast (since the rows matching the given condition is obtained using modified binary search as opposed to the by=.EACHI looping variant shown in @tospig's answer).

See the installation instructions for devel version here.

As far as I can see, 'd1' and 'd2' isn't defined. Although trivial, perhaps you may add their creation to your answer, to make it a nice, complete canonical ;) — Henrik, Mar 02 '18 at 10:45

score 1 · Answer 3 · answered Feb 22 '16 at 22:55

With lubridate you can do:

library(lubridate)
do.call(rbind, apply(key,1, FUN=function(k) 
      df[df$id == k['id'] &
      df$datetimeUTC >= ymd_hms( k['datetimeUTC']) -hours(2) &
      df$datetimeUTC <= ymd_hms(k['datetimeUTC']) +hours(2),]))

 1: 2013-05-01 07:00:00 153  0.53
 2: 2013-05-01 08:00:00 153  0.46
 3: 2013-05-01 09:00:00 153  0.44
 4: 2013-05-01 10:00:00 153  0.48
 5: 2013-05-01 11:00:00 153  0.49
 6: 2013-11-24 12:00:00 154  0.55
 7: 2013-11-24 13:00:00 154  0.61
 8: 2013-11-24 14:00:00 154  0.70
 9: 2013-11-24 15:00:00 154  0.74
10: 2013-11-24 16:00:00 154  0.78
11: 2015-08-02 03:00:00 240  0.20
12: 2015-08-02 04:00:00 240  0.22
13: 2015-08-02 05:00:00 240  0.22
14: 2015-08-02 06:00:00 240  0.27
15: 2015-08-02 07:00:00 240  0.23
16: 2015-08-02 13:00:00 240  0.21
17: 2015-08-02 14:00:00 240  0.22
18: 2015-08-02 15:00:00 240  0.24
19: 2015-08-02 16:00:00 240  0.25
20: 2015-08-02 17:00:00 240  0.12

Thanks HubertL, this appears to work well and I'll give this a try on the full dataset tomorrow. — Wyldsoul, Feb 22 '16 at 23:30

subset data.frame base on a time interval + or - list of dates

3 Answers3

Linked