R - How to sum time differences within specific time-interval buckets?

Question

I have hundreds of thousands of records with a start time, stop time, and activity type for that timeframe. I'd like to be able to find the total amount of time that overlaps with specific, preset time intervals and group by the type of activity.

Here's what I have:

    start                      stop                      activity
 2015-12-24 12:55:00.000    2015-12-24 13:25:00.000         a
 2015-12-24 13:45:00.000    2015-12-24 13:59:00.000         b
 2015-12-24 13:55:00.000    2015-12-24 14:10:00.000         b
 2015-12-24 14:13:00.000    2015-12-24 15:05:00.000         a

And here's an example of what I'd like to have as an end result. This is shown grouped by the half hour but being able to designate any normal time interval(day, hour, half-hour, quarter-hour) is ideal:

   intervalStart               activityMinutes      activity       
 2015-12-24 12:30:00.000         5                     a
 2015-12-24 13:00:00.000         25                    a                                 
 2015-12-24 13:30:00.000         19                    b
 2015-12-24 14:00:00.000         17                    a
 2015-12-24 14:00:00.000         10                    b
 2015-12-24 14:30:00.000         30                    a
 2015-12-24 15:00:00.000         5                     a

I have a SQL query that works very well for this that I got from a previous stackoverflow question that I posted earlier this year:

How to sum the activity time that occurred within 15-minute intervals using overlapping start and stop times(SQL)(t-SQL)

However, using this query is really cumbersome in R and doesn't always work depending on the date range I'm looking at. It's also very slow and I'm hoping that a solution withing R will be faster and more reliable.

Thanks for the help and let me know if there's any other information I can provide!

EDIT - Edited to show results in 30-min interval time buckets

@Jaap I've looked at difftime but I haven't found a way to keep time differences within pre-defined intervals. — William, Dec 28 '15 at 15:33
Using dplyr, you can try something like this to group by the day and hour per activity: df %>% mutate(interval = format(start, '%Y-%m-%d %H')) %>% group_by(interval, activity) %>% summarise(activityMinutes sum(difftime(stop, start, units = 'mins'))) — Gopala, Dec 28 '15 at 15:33
This above method will group by the hour of day. If you want arbitrary intervals, you should provide more specific example input and specific desired output so someone can help you. — Gopala, Dec 28 '15 at 15:39
You could do something like `library(data.table) ; setDT(df)[, seq(start, stop, by = "min"), by = names(df)][, .N, by = .(as.IDate(V1), hour(V1), activity)]` if `start` and `stop` are of class `POSIXct`, though I can't vouch for efficiency. — David Arenburg, Dec 28 '15 at 15:46
@user3949008 I'm familiar with dplyr so I'll give it a try. i edited the post to show output bucketed in 30-minute intervals — William, Dec 28 '15 at 15:59
@DavidArenburg start and stop are in POSIXct so I'll give this a try. Is there a way to specify a time interval other than an hour? — William, Dec 28 '15 at 16:00
To specify time intervals other than hour, you will need to write some more code to convert output of 'format()' above to desired 30 minutes. For example, you can write df$minuteInterval = ifelse(as.numeric(format(start, '%M')) < 30, 00, 30). Then, you can add minuteInterval to the above 'group_by()' I showed you with dplyr. — Gopala, Dec 28 '15 at 16:04
@user3949008 Unless I'm doing something wrong, it looks like the dplyr method isn't going to give what I need. The results for row 1 in my example would give a total time of 30 minutes and put all of it into the 12:30 interval. Instead, I need the time split up across the 12:30 and 13:00 intervals — William, Dec 28 '15 at 18:52
@DavidArenburg Your solution seems to work! I haven't used data.table before but I might need to start. If you submit an answer, I'll mark it as the correct one. Thanks — William, Dec 28 '15 at 20:48
If you provide a reproducible example, I can provide working dplyr code. If I were you, I would learn dplyr vs. data.table. Latter is excellent, but dplyr is super versatile and highly intuitive - especially for newer to R people. — Gopala, Dec 30 '15 at 15:53
@user3949008 I've worked with dplyr for a while so I'm familiar with a lot of the basic things it can do. I'm learning data.table and will sometimes use it over dplyr, depending on what I'm trying to do. I do agree that it isn't quite as intuitive, though — William, Jan 08 '16 at 17:45

R - How to sum time differences within specific time-interval buckets?

0 Answers0