R: new column for unique observations over time elapsed (indexing from previous)

Question

I am trying to create a new column that assigns a unique value to the observation (row) only IF the recorded observation occur after a specific time following the last observation (see data frame).

Context: I set up camera trap to observe what species visit a particular plot, every visit by a species should get a unique visitID. The actual database contains more complexity but this is the main problem I have.

new.df <- data.frame(
   species = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), 
   visit.time = c(seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'), 
                  seq(ymd_hm('2015-01-01 00:00'), ymd_hm('2015-01-01 00:10'), by = '2 mins'))
      )

> new.df
   species          visit.time
1        A 2015-01-01 00:00:00
2        A 2015-01-01 00:02:00
3        A 2015-01-01 00:04:00
4        A 2015-01-01 00:06:00
5        A 2015-01-01 00:08:00
6        A 2015-01-01 00:10:00
7        B 2015-01-01 00:00:00
8        B 2015-01-01 00:02:00
9        B 2015-01-01 00:04:00
10       B 2015-01-01 00:06:00
11       B 2015-01-01 00:08:00
12       B 2015-01-01 00:10:00

I would like to create a new column called "visitID" that records an each species' visit that occured. However, I only want to assign a unique number only of the visit occurred at least 2 minutes after the previous recorded visit:

    > new.df
   species          visit.time visitID
1        A 2015-01-01 00:00:00 1
2        A 2015-01-01 00:02:00 -
3        A 2015-01-01 00:04:00 2
4        A 2015-01-01 00:06:00 -
5        A 2015-01-01 00:08:00 3
6        A 2015-01-01 00:10:00 -
7        B 2015-01-01 00:00:00 1
8        B 2015-01-01 00:02:00 -
9        B 2015-01-01 00:04:00 2
10       B 2015-01-01 00:06:00 -
11       B 2015-01-01 00:08:00 3
12       B 2015-01-01 00:10:00 -

where - is just an NA

I would usually try using dplyr:mutate with conditional terms ifelse, the problem is I do not know how to account for time elapse from the previous visit.

Please let me know if there are more details that could provide. Thanks!

All of the visits in your example data occur two minutes after the previous visit. Shouldn't they all have an ID? — , Jul 30 '19 at 06:27
That's my bad, I tried to keep the example simple so I decided to go with 2 minutes. But the respond below are all great! — PanOle, Jul 30 '19 at 09:35

score 1 · Answer 1 · answered Jul 30 '19 at 07:39

You can return the differences using diff(). Just make sure to prepend a 2 to each group of species, i.e. c(2, diff(visit.time) / 60), so that the first visit for each species always gets an ID (R will throw an error otherwise).

The only criterion you've given for visitID is that the values for each species are unique, but not that they are consecutive, so I'll assume that 1 5 6 is just as valid as 1 2 3. This simplifies things quite a bit:

library(dplyr)

df %>% 
    group_by(species) %>% 
    mutate(tdiff = c(2, diff(visit.time) / 60),
           visitID = seq_along(species),
           visitID = ifelse(tdiff >= 2, visitID, NA)
           )

Which will return the following data frame:

# A tibble: 12 x 4
# Groups:   species [2]
   species visit.time          tdiff visitID
   <fct>   <dttm>              <dbl>   <int>
 1 A       2015-01-01 00:02:10 2           1
 2 A       2015-01-01 00:03:00 0.833      NA
 3 A       2015-01-01 00:03:10 0.167      NA
 4 A       2015-01-01 00:04:00 0.833      NA
 5 A       2015-01-01 00:07:40 3.67        5
 6 A       2015-01-01 00:09:40 2           6
 7 B       2015-01-01 00:00:40 2           1
 8 B       2015-01-01 00:01:10 0.5        NA
 9 B       2015-01-01 00:04:10 3           3
10 B       2015-01-01 00:05:40 1.5        NA
11 B       2015-01-01 00:09:40 4           5
12 B       2015-01-01 00:09:50 0.167      NA

Note that I've used a modified dataset because the differences between the times in the example you provide are all == 2.

Data:

df <- structure(list(species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 
2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), 
    visit.time = structure(c(1420070530, 1420070580, 1420070590, 
    1420070640, 1420070860, 1420070980, 1420070440, 1420070470, 
    1420070650, 1420070740, 1420070980, 1420070990), class = c("POSIXct", 
    "POSIXt"), tzone = "UTC")), class = "data.frame", row.names = c(NA, 
-12L))

In case of several small time differences accumulating to more than 2 minutes, it seems they would not detect a new visitID. — Joris C., Jul 30 '19 at 08:12
@JorisChau nothing in the OP's question indicates that this would be a problem. I think they are interested in longer intervals where there are no visits, not in accumulated differences of > 2 minutes. You might want to clarify with the OP directly. — , Jul 30 '19 at 08:22
Thank you gersht, this definitely is good enough to let me continue my work! — PanOle, Jul 30 '19 at 09:37

f.lechleitner · Accepted Answer · 2019-07-30T08:29:02.887

From your desired output it seems you want a new ID when the time difference between the current and the last recorded visit that received a new ID exceeds 2 minutes. In that case, we could use a cumulative sum that resets at a certain threshold. I've used the function from this answer: dplyr / R cumulative sum with reset

sum_reset_at <- function(thresh) {
  function(x) {
    accumulate(x, ~if_else(.x>thresh, .y, .x+.y))
  }  
}

new.df <- new.df %>%
  group_by(species) %>% # group df by species
  arrange(species, visit.time) %>% # sort the data
  mutate(
    time.elapsed = as.numeric(difftime(visit.time, lag(visit.time), units = "mins")), # calculate time difference in minutes
    time.elapsed = ifelse(is.na(time.elapsed), 0, time.elapsed), # replace NAs at first entries with 0s
    time.elapsed.cum = sum_reset_at(2)(time.elapsed), # build cumulative sum that resets once the value is greater (not greater or equal) to two
    newID = ifelse(time.elapsed.cum > 2, TRUE, FALSE), # build logical vector that marks the position where a new ID starts
    visitID = cumsum(newID) + 1, # generate visit IDs
    visitID = replace(visitID, duplicated(visitID), NA) # keep only first entry of an id, replace rest with NA
  )

Output:

> new.df
# A tibble: 12 x 6
# Groups:   species [2]
   species visit.time          time.elapsed time.elapsed.cum newID visitID
   <fct>   <dttm>                     <dbl>            <dbl> <lgl>   <dbl>
 1 A       2015-01-01 00:00:00            0                0 FALSE       1
 2 A       2015-01-01 00:02:00            2                2 FALSE      NA
 3 A       2015-01-01 00:04:00            2                4 TRUE        2
 4 A       2015-01-01 00:06:00            2                2 FALSE      NA
 5 A       2015-01-01 00:08:00            2                4 TRUE        3
 6 A       2015-01-01 00:10:00            2                2 FALSE      NA
 7 B       2015-01-01 00:00:00            0                0 FALSE       1
 8 B       2015-01-01 00:02:00            2                2 FALSE      NA
 9 B       2015-01-01 00:04:00            2                4 TRUE        2
10 B       2015-01-01 00:06:00            2                2 FALSE      NA
11 B       2015-01-01 00:08:00            2                4 TRUE        3
12 B       2015-01-01 00:10:00            2                2 FALSE      NA

So basically we are summing up the time differences until they exceed two minutes, then we reset the sum to zero. Where this cumsum is greater than two we need to add a new ID. We do this by adding a logical vector and building the cumsum of that vector (because TRUE = 1 and FALSE = 0). Lastly, we replace the duplicated IDs in the groups to get the output you specified. We can drop the columns you don't need:

> new.df %>% select(-c(time.elapsed, time.elapsed.cum, newID))
# A tibble: 12 x 3
# Groups:   species [2]
   species visit.time          visitID
   <fct>   <dttm>                <dbl>
 1 A       2015-01-01 00:00:00       1
 2 A       2015-01-01 00:02:00      NA
 3 A       2015-01-01 00:04:00       2
 4 A       2015-01-01 00:06:00      NA
 5 A       2015-01-01 00:08:00       3
 6 A       2015-01-01 00:10:00      NA
 7 B       2015-01-01 00:00:00       1
 8 B       2015-01-01 00:02:00      NA
 9 B       2015-01-01 00:04:00       2
10 B       2015-01-01 00:06:00      NA
11 B       2015-01-01 00:08:00       3
12 B       2015-01-01 00:10:00      NA

thank you very much @breeljausn I was able to adapt your code to what I needed with my real data set — PanOle, Jul 30 '19 at 09:36

R: new column for unique observations over time elapsed (indexing from previous)

2 Answers2

Data: