How to subset and extract time series by time interval in row

Question

I am working on an analysis of animal locations that requires locations for each animal to be 60 minutes or greater apart. Time differences in locations among animals does not matter. The data set has a list of animal IDs and date and time of each location, example below.

For example, for animal 6 below, starting at the 16:19 location, the code would iterate through locations until it finds a location that is 60+ minutes from 16:19. In this case it would be the 17:36 location. Then, the code would start from the 17:36 location to find the next location (18:52) 60+ minutes, and so on. Each of the locations 60+ minutes from each other would then be extracted to a separate dataframe.

I have wrote a loop in R to subset the data, but having issue with the code not accounting for a change in date when calculating if locations are 60 minutes or greater.

I have been exploring the lubridate package, which seems like it may have an easier way to address subsetting my data. However, I have not yet found a solution to subsetting the data to my specifications using this package. Any suggestions for using lubridate or an alternative method would be greatly appreciated.

Thank you in advance for your consideration.

>data(locdata);
>view(locdata);
id  date    time
6   30-Jun-09   16:19
6   30-Jun-09   16:31
6   30-Jun-09   17:36
6   30-Jun-09   17:45
6   30-Jun-09   18:00
6   30-Jun-09   18:52
6   7-Aug-10    5:30
6   7-Aug-10    5:45
6   7-Aug-10    6:00
6   7-Aug-10    6:45
23  30-Jun-09   17:15
23  30-Jun-09   17:38
23  30-Jun-09   17:56
23  30-Jun-09   20:00
23  30-Jun-09   22:19
23  18-Jul-11   16:22
23  18-Jul-11   17:50
23  18-Jul-11   18:15

The output from the example data above would look like this:

id  date    time
6   30-Jun-09   16:19
6   30-Jun-09   17:36
6   30-Jun-09   18:52
6   7-Aug-10    5:30
6   7-Aug-10    6:45
23  30-Jun-09   17:15
23  30-Jun-09   20:00
23  30-Jun-09   22:19
23  18-Jul-11   16:22
23  18-Jul-11   17:50

@BondedDust, I updated my question with an example of what I need to code to do. I would appreciate any thoughts you have to accomplishing this. — wildlifeduke, Apr 27 '15 at 21:54
You have items that are out of order and it is not clear whether items on different dates are to be handled differently. Please, please post what the completed output should look like rather than only using natural language. — IRTFM, Apr 27 '15 at 22:06
@BondedDust, the output would look like this based on my example: id date time 6 30-Jun-09 16:19 6 30-Jun-09 17:36 6 30-Jun-09 18:52 6 7-Aug-10 5:30 6 7-Aug-10 6:45 23 30-Jun-09 17:15 23 30-Jun-09 20:00 23 30-Jun-09 22:19 23 18-Jul-11 16:22 23 18-Jul-11 17:50 — wildlifeduke, Apr 27 '15 at 22:37
Arrrgh. Edit your question rather than putting unformatted material in comments. — IRTFM, Apr 27 '15 at 22:57
I apologize @BondedDust, this is my first time using Stackoverflow and replied using the wrong button. I have inserted the output sought at the end of my question. — wildlifeduke, Apr 27 '15 at 23:27
So you want the first date-time in any given date and given subject-id and you want to then "hop" forward to successive times that are more than 60 minutes from the prior hit. This might be a good place to use `tapply` and `Reduce`. — IRTFM, Apr 28 '15 at 00:47
Thank you @BondedDust for taking the time to put this code together. I ran the tapply function code you provided at the top, but it gave me errors saying unexpected commas and parentheses in a few places. I tried playing around with those, but it did not seem to help. I would greatly appreciate if you can figure out a way to get the output I am seeking. Possibly with lapply-split. I can email you my real dataset if that would help. Thanks again. — wildlifeduke, Apr 28 '15 at 15:34

Ajar · Answer 1 · 2015-04-28T21:52:05.497

If I understood you correctly, I think you're looking for something along these lines:

library(dplyr)
library(lubridate)

locdata %>% 
    mutate(timestamp = dmy_hm(paste(date, time))) %>%
    group_by(id, date) %>%
    mutate(delta = timestamp - lag(timestamp))

If you haven't used dplyr or magrittr before, the syntax above may be unclear. The %>% operator passes the results of each computation to the next function, so the above code does the following:

Parse the date and time into a timestamp that R understands, using lubridate
Group the data by id and unique dates
Within each group, calculate the duration between observations

If you want to save the output, change the first line to something like results <- locdata %>%.

Based on your updated question and revised data, I believe this works:

locdata %>% 
    mutate(timestamp = dmy_hm(paste(date, time))) %>%
    group_by(id, date) %>%
    mutate(delta = timestamp - first(timestamp),
           steps = as.numeric(floor(delta / 3600)), 
           change = ifelse(is.na(steps - lag(steps)), 1, steps - lag(steps))) %>%
    filter(change > 0) %>%
    select(id, date, timestamp)

Output:

Source: local data frame [10 x 3]
Groups: id, date

   id      date           timestamp
1   6 30-Jun-09 2009-06-30 16:19:00
2   6 30-Jun-09 2009-06-30 17:36:00
3   6 30-Jun-09 2009-06-30 18:52:00
4   6  7-Aug-10 2010-08-07 05:30:00
5   6  7-Aug-10 2010-08-07 06:45:00
6  23 30-Jun-09 2009-06-30 17:15:00
7  23 30-Jun-09 2009-06-30 20:00:00
8  23 30-Jun-09 2009-06-30 22:19:00
9  23 18-Jul-11 2011-07-18 16:22:00
10 23 18-Jul-11 2011-07-18 17:50:00

How it works:

Create timestamp as before
Group the data by id and date
Compute the delta in seconds between the first timestamp in each group (i.e. the first observation of one animal in a given day) and each subsequent observation in that group, store that in a new column delta
Determine which observations (if any) are more than 3600 seconds from the first one, in increments of 3600 seconds; store that in a new column steps
Determine which observations are one or more step from the first observation (and keep the first observation as well); store that in a new column change
Keep only observations where change is 1 or more -- i.e. where the observation is one or more hours from the previous observation and from the first observation in the group
Keep only the columns of interest

To get comfortable with how it works, drop the filter and select from the end and inspect the output:

Source: local data frame [18 x 7]
Groups: id, date

   id      date  time           timestamp      delta steps change
1   6 30-Jun-09 16:19 2009-06-30 16:19:00     0 secs     0      1
2   6 30-Jun-09 16:31 2009-06-30 16:31:00   720 secs     0      0
3   6 30-Jun-09 17:36 2009-06-30 17:36:00  4620 secs     1      1
4   6 30-Jun-09 17:45 2009-06-30 17:45:00  5160 secs     1      0
5   6 30-Jun-09 18:00 2009-06-30 18:00:00  6060 secs     1      0
6   6 30-Jun-09 18:52 2009-06-30 18:52:00  9180 secs     2      1
7   6  7-Aug-10  5:30 2010-08-07 05:30:00     0 secs     0      1
8   6  7-Aug-10  5:45 2010-08-07 05:45:00   900 secs     0      0
9   6  7-Aug-10  6:00 2010-08-07 06:00:00  1800 secs     0      0
10  6  7-Aug-10  6:45 2010-08-07 06:45:00  4500 secs     1      1
11 23 30-Jun-09 17:15 2009-06-30 17:15:00     0 secs     0      1
12 23 30-Jun-09 17:38 2009-06-30 17:38:00  1380 secs     0      0
13 23 30-Jun-09 17:56 2009-06-30 17:56:00  2460 secs     0      0
14 23 30-Jun-09 20:00 2009-06-30 20:00:00  9900 secs     2      2
15 23 30-Jun-09 22:19 2009-06-30 22:19:00 18240 secs     5      3
16 23 18-Jul-11 16:22 2011-07-18 16:22:00     0 secs     0      1
17 23 18-Jul-11 17:50 2011-07-18 17:50:00  5280 secs     1      1
18 23 18-Jul-11 18:15 2011-07-18 18:15:00  6780 secs     1      0

Thank you very much @Ajar. Your code is close to what I am thinking, but I did not explain my goal in enough detail. I updated my question with an example of what I am trying to do, using animal (id) 6. The difference is that instead of just the time between each consecutive locations, I need to figure out the next location that is 60+ minutes from the 'start' location, save that second location, and then find the third location 60+ minutes from the second location, and so on. Any help would be greatly appreciated! Thanks again. — wildlifeduke, Apr 27 '15 at 21:56
Thank you very much for providing this nice code. I tried applying it to my real data, but am repeatedly running into an error at the "steps" section of the code. The error says the following: Error in Ops.POSIXt(c(0, 900, 1800, 2700, 3600, 4560, 5460, 6300, 7200, : '/' not defined for "POSIXt" objects. I did a web search but could not find a solution to the error. Any thoughts you have would be appreciated. Thanks again. — wildlifeduke, May 01 '15 at 05:30
Are your date/time formats consistent throughout your data? The errors you reported with my code and @BondedDust's code suggest to me that you have inconsistent date formatting, and that's causing errors when you try to convert the character date strings to timestamps. — Ajar, May 01 '15 at 14:30

score 1 · Answer 2 · answered Apr 28 '15 at 02:26

I managed to build a function using tapply that selects the proper times and have unpacked it in a couple of different versions, although I've not yet assembled it in a form that matches your suggested output. Thinking about it I'm wondering if it may be easier get the proper form using lapply-split:

 tapply(dat$d_time, list(dat$id, dat$date), 
                    function(dt) {
        Reduce( function(x,y) {
                   if( as.numeric(y)-as.numeric(tail(x,1)) < 60*60){
                      x } else {
                     (x,y)} } , 
              dt, 
              init=dt[1]))
 #------------
   18-Jul-11 30-Jun-09 7-Aug-10 
6  NULL      Numeric,3 Numeric,2
23 Numeric,2 Numeric,3 NULL    

# c( ) removes the dimensions and unfortunately the INDEX items
c(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
[[1]]
NULL

[[2]]
[1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"

[[3]]
[1] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
[3] "2009-06-30 18:52:00 PDT"

[[4]]
[1] "2009-06-30 17:15:00 PDT" "2009-06-30 20:00:00 PDT"
[3] "2009-06-30 22:19:00 PDT"

[[5]]
[1] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"

[[6]]
NULL

# unlist does something similar
unlist(tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])))
 [1] 1311031320 1311036600 1246403940 1246408560 1246413120 1246407300
 [7] 1246417200 1246425540 1281184200 1281188700

# It's possible to restore the date-time class.
 > as.POSIXct(unlist(tapply(dat$d_time, 
                            list(dat$id, dat$date), 
                            function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1]))) , origin="1970-01-01")

 [1] "2011-07-18 16:22:00 PDT" "2011-07-18 17:50:00 PDT"
 [3] "2009-06-30 16:19:00 PDT" "2009-06-30 17:36:00 PDT"
 [5] "2009-06-30 18:52:00 PDT" "2009-06-30 17:15:00 PDT"
 [7] "2009-06-30 20:00:00 PDT" "2009-06-30 22:19:00 PDT"
 [9] "2010-08-07 05:30:00 PDT" "2010-08-07 06:45:00 PDT"

 # This keeps the INDEX values as row and column names
 as.data.frame( tapply(dat$d_time, list(dat$id, dat$date), function(dt) Reduce( function(x,y) if(as.numeric(y)-as.numeric(tail(x,1)) < 60*60){ x } else {c(x,y)} , dt, init=dt[1])) )
                18-Jul-11                          30-Jun-09
6                    NULL 1246403940, 1246408560, 1246413120
23 1311031320, 1311036600 1246407300, 1246417200, 1246425540
                 7-Aug-10
6  1281184200, 1281188700
23                   NULL

How to subset and extract time series by time interval in row

2 Answers2

Linked