0

I have hundreds of thousands of records with a start time, stop time, and activity type for that timeframe. I'd like to be able to find the total amount of time that overlaps with specific, preset time intervals and group by the type of activity.

Here's what I have:

    start                      stop                      activity
 2015-12-24 12:55:00.000    2015-12-24 13:25:00.000         a
 2015-12-24 13:45:00.000    2015-12-24 13:59:00.000         b
 2015-12-24 13:55:00.000    2015-12-24 14:10:00.000         b
 2015-12-24 14:13:00.000    2015-12-24 15:05:00.000         a

And here's an example of what I'd like to have as an end result. This is shown grouped by the half hour but being able to designate any normal time interval(day, hour, half-hour, quarter-hour) is ideal:

   intervalStart               activityMinutes      activity       
 2015-12-24 12:30:00.000         5                     a
 2015-12-24 13:00:00.000         25                    a                                 
 2015-12-24 13:30:00.000         19                    b
 2015-12-24 14:00:00.000         17                    a
 2015-12-24 14:00:00.000         10                    b
 2015-12-24 14:30:00.000         30                    a
 2015-12-24 15:00:00.000         5                     a

I have a SQL query that works very well for this that I got from a previous stackoverflow question that I posted earlier this year:

How to sum the activity time that occurred within 15-minute intervals using overlapping start and stop times(SQL)(t-SQL)

However, using this query is really cumbersome in R and doesn't always work depending on the date range I'm looking at. It's also very slow and I'm hoping that a solution withing R will be faster and more reliable.

Thanks for the help and let me know if there's any other information I can provide!

EDIT - Edited to show results in 30-min interval time buckets

Community
  • 1
  • 1
William
  • 166
  • 10
  • Have a look at `?difftime` – Jaap Dec 28 '15 at 15:25
  • @Jaap I've looked at difftime but I haven't found a way to keep time differences within pre-defined intervals. – William Dec 28 '15 at 15:33
  • Using dplyr, you can try something like this to group by the day and hour per activity: df %>% mutate(interval = format(start, '%Y-%m-%d %H')) %>% group_by(interval, activity) %>% summarise(activityMinutes sum(difftime(stop, start, units = 'mins'))) – Gopala Dec 28 '15 at 15:33
  • This above method will group by the hour of day. If you want arbitrary intervals, you should provide more specific example input and specific desired output so someone can help you. – Gopala Dec 28 '15 at 15:39
  • 1
    You could do something like `library(data.table) ; setDT(df)[, seq(start, stop, by = "min"), by = names(df)][, .N, by = .(as.IDate(V1), hour(V1), activity)]` if `start` and `stop` are of class `POSIXct`, though I can't vouch for efficiency. – David Arenburg Dec 28 '15 at 15:46
  • @user3949008 I'm familiar with dplyr so I'll give it a try. i edited the post to show output bucketed in 30-minute intervals – William Dec 28 '15 at 15:59
  • @DavidArenburg start and stop are in POSIXct so I'll give this a try. Is there a way to specify a time interval other than an hour? – William Dec 28 '15 at 16:00
  • To specify time intervals other than hour, you will need to write some more code to convert output of 'format()' above to desired 30 minutes. For example, you can write df$minuteInterval = ifelse(as.numeric(format(start, '%M')) < 30, 00, 30). Then, you can add minuteInterval to the above 'group_by()' I showed you with dplyr. – Gopala Dec 28 '15 at 16:04
  • @user3949008 Unless I'm doing something wrong, it looks like the dplyr method isn't going to give what I need. The results for row 1 in my example would give a total time of 30 minutes and put all of it into the 12:30 interval. Instead, I need the time split up across the 12:30 and 13:00 intervals – William Dec 28 '15 at 18:52
  • @DavidArenburg Your solution seems to work! I haven't used data.table before but I might need to start. If you submit an answer, I'll mark it as the correct one. Thanks – William Dec 28 '15 at 20:48
  • If you provide a reproducible example, I can provide working dplyr code. If I were you, I would learn dplyr vs. data.table. Latter is excellent, but dplyr is super versatile and highly intuitive - especially for newer to R people. – Gopala Dec 30 '15 at 15:53
  • @user3949008 I've worked with dplyr for a while so I'm familiar with a lot of the basic things it can do. I'm learning data.table and will sometimes use it over dplyr, depending on what I'm trying to do. I do agree that it isn't quite as intuitive, though – William Jan 08 '16 at 17:45

0 Answers0