overlap join to calculate averages within intervals?

Question

I frequently have to average time series data within given time intervals ('events'), basically as has been asked here.

As suggested in the answers, I use an SQL statement on my data in 'long' format. Here is an example:

#create dummy data frame
set.seed(1)
data <- data.frame(
  date = seq(from = as.POSIXct("2014-01-01 00:00"),
             to = as.POSIXct("2014-01-31 23:00"),
             by = 300),
  A = runif(8917),
  B = runif(8917),
  C = runif(8917),
  D = runif(8917)
)

#convert to long format
require(dplyr)
data <- data %>%
  gather(class,value,A:D)

# create dummy events
events <- data.frame(
  id = c("blue","red","green","yellow"),
  start = as.POSIXct(c("2014-01-03 13:00",
                       "2014-01-12 08:00",
                       "2014-01-18 10:00",
                       "2014-01-27 23:00")),
  stop = as.POSIXct(c("2014-01-03 19:00",
                       "2014-01-13 17:00",
                       "2014-01-20 10:00",
                       "2014-01-28 20:00"))
)


#average value within events, grouped by class
require(sqldf)
results <- sqldf("
     SELECT x.id, y.class, avg(y.value) AS mean 
     FROM events as x, data as y 
     WHERE y.date between x.start and x.stop 
     GROUP BY x.id, y.class
")

which gives the desired output

       id class      mean
1    blue     A 0.4879129
2    blue     B 0.4945888
3    blue     C 0.5312504
4    blue     D 0.4968260
5   green     A 0.5235671
6   green     B 0.5030602
7   green     C 0.5071219
8   green     D 0.5002010
9     red     A 0.5122966
10    red     B 0.4767966
11    red     C 0.5032387
12    red     D 0.5018389
13 yellow     A 0.4727868
14 yellow     B 0.4626688
15 yellow     C 0.4930207
16 yellow     D 0.5184966

However, as my real data is huge (long format can reach several million rows), the SQL operation needs quite some time.

Are there more efficient ways to do this operation? I've stumpled across data.table::foverlaps, which is called an 'overlap join', but I don't fully understand if this is what I need.

If there was an efficient way to add en 'event' column to the data, indicating for each row (date), which event it falls into, then I could do a grouped summary with dplyr compare to the SQL statement. But I don't know how to do this...

Any suggestions from the experts would be appreciated.

Update

As suggested in the comments, I have added the creation of an index to my SQL statement. Unfortunately this did not speed up things for one of my large real world problems. The calculation still took ~40 mins to run.

Then I copy-pasted the data.table solution, kindly offered by David, and was impressed to see that it runs in less than 1 second on exactly the same real-world dataset.

I still don't understand how and why it does what it does, but my motivation to spend some time on learning the data.table syntax has certainly increased a lot. Thanks again for that!

Did you try using `data.table::foverlaps`? You can check its documentation or search over SO. There are many similar Q/A illustrating how to use it in different situations. — David Arenburg, Aug 05 '15 at 09:12
Have you looked at `?foverlaps`? If so, what don't you understand? — Arun, Aug 05 '15 at 09:21
I did look at it, but I haven't tried using it, because I was not sure if it is what I need. To be honest, I am a bit intimidated by the data.table syntax, I am not very experienced in coding... But if this is the function I need, then I will try and understand it. Thanks. — user3460194, Aug 05 '15 at 09:24
You should probably stick to dplyr / SQL if you're intimidated. Good luck! — Arun, Aug 05 '15 at 09:25
Try adding an index: `sqldf(c("CREATE INDEX a1 ON data(class, value)", "SELECT x.id, y.class, avg(y.value) AS mean FROM events as x, main.data as y WHERE y.date between x.start and x.stop GROUP BY x.id, y.class "))` — G. Grothendieck, Aug 05 '15 at 09:47
This sounds promising and easy enough for me to implement :-) Thanks, I'll try later. — user3460194, Aug 05 '15 at 09:54

score 2 · Accepted Answer · answered Aug 05 '15 at 10:25

Here's a possible data.table::foverlaps solution

library(data.table)
setDT(data)[, `:=`(start = date, stop = date)]
setkey(setDT(events), start, stop)
foverlaps(data, events, nomatch = 0L)[, .(Mean = mean(value)), keyby = .(id, class)]
#         id class      Mean
#  1:   blue     A 0.4879129
#  2:   blue     B 0.4945888
#  3:   blue     C 0.5312504
#  4:   blue     D 0.4968260
#  5:  green     A 0.5235671
#  6:  green     B 0.5030602
#  7:  green     C 0.5071219
#  8:  green     D 0.5002010
#  9:    red     A 0.5122966
# 10:    red     B 0.4767966
# 11:    red     C 0.5032387
# 12:    red     D 0.5018389
# 13: yellow     A 0.4727868
# 14: yellow     B 0.4626688
# 15: yellow     C 0.4930207
# 16: yellow     D 0.5184966

The logic seems pretty straight forward to me.

Set start and stop columns within data to overlap against.
key the events data set by the same columns.
Run foverlaps and remove unmatched intervals (nomatch = 0L).
Calculate mean(value) by id and class

I guess copy/paste-ing wouldn't be as intimidating as investing time in going thro' the examples in `?foverlaps`. I'm pretty sure there are also similar `foverlaps()` Q/A here on SO... — Arun, Aug 05 '15 at 11:05
@David: I think it is well worth posting it and I thank you for taking the time to do so. — user3460194, Aug 05 '15 at 11:46

MikeRSpencer · Answer 2 · 2015-08-05T09:36:56.993

-1

You may be best offloading the task completely to a database. Have a look at the RSQLite package. If your data are really large store them in a database like SQLite and get the db to do the subsetting and grouping this should improve the speed of your task. I've written a couple of posts that may help, one on writing to SQLite and one that includes a section on reading from SQLite.

A reason you may not want to do this is if you are repeating this on lots of datasets, as the speed improvements in your query will be lost through the time taken to write your data to a db.

edited Aug 05 '15 at 09:36

answered Aug 05 '15 at 09:30

MikeRSpencer

1,276
10
24

1

In what way will this be different from the `sqldf` solution OP has already provided? Also, why would this be faster? Your posts talk about reading and writing, but not about computing overlaps over interval ranges. I'm quite confused as to how this answers the Q. – Arun Aug 05 '15 at 09:38
Thanks for clarifying the downvote Arun, it's good to know why. I have no bench marks comparing sqldf and SQLite, so you may be right - perhaps there is no speed difference. From my own work I've found a database (generally SQLite) quicker and able to handle bigger data on subsetting tasks than base R. I didn't talk about computing overlaps as the SQL the OP had could be used as is within a dbSendQuery() command. – MikeRSpencer Aug 05 '15 at 09:44

overlap join to calculate averages within intervals?

2 Answers2

Linked