How to select rows with times separated by a certain interval in R

Question

I have a large data frame consisting of camera trap observations from camera traps placed at different locations every month. One observation consists of five photos triggered by one animal. Excerpt of the dataframe

dput of the first 20 rows:

>structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L, 
>2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng",
"I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972, 
1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982, 
1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025, 
1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705, 
1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid", 
"species", "time"), row.names = c(NA, 20L), class = "data.frame")

For analysis, I have determined a period of 2 min between consecutive observations to be considered independent. To achieve this, I computed the time difference between two consecutive photos for each camera deployment. Following that, I selected all times with a difference larger than two minutes. I then subsetted the data frame to only contain photos taken at those selected times:

1) First I used dplyr to compute the time interval to the previous photo of the same deployment. For the first observation of each deployment I randomly choose 1000 as a number bigger than 120, so those are included in my selection later.

library(dplyr)
deerobs_tbl<-tbl_df(Deerobs)
deerobs_gr<-group_by(deerobs_tbl,deploymentid)
deerobs_or<-arrange(deerobs_gr$time,.by_group = T)
deerobs_2<-mutate(deerobs_or,diff=c(1000,diff(time)))
deerobs2_df<-data.frame(deerobs_2)

2) I guess this would have also been possible with dplyr, but plyr was easier to use. I built a dataframe only with columns for the deployment ID, the time and the difference in the time to the previous picture. Then I selected for each deployment the times, that were more than 2 min apart and selected all rows with those times.

library (plyr)
deerobs_times<-data.frame(deerobs2_df$time,deerobs2_df$deploymentid,deerobs2_df$diff)
deerobs_times_apart<-ddply(deerobs_times,"deerobs2_df.deploymentid",subset,deerobs2_df.diff>120)
deerobs_t<-deerobs_times_apart[,1]
Deerobs_subset<-subset(deerobs2_df,deerobs2_df$time%in%deerobs_t)

The only problem is that this removes far more observations than would be necessary. The number of photos is reduced from more than 9000 to less than 3000. For example, if ten observations follow each other with an interval of 1.5 minutes, all the photos are removed, although five are more than two minutes apart from each other. Is there any possibility to circumvent this problem and select all of the observations which are more than two minutes apart?

Can you post your data using `dput` instead? Can not grab data from the image. — Gautam, Aug 15 '18 at 15:30
I added a link to a file with the data, I hope it works this way. — Biomaik, Aug 15 '18 at 15:59
Did you just post a link to your data on filedropper? Please post your data using `dput` instead. Please read [How to make a great R reproducible example?](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) for help. If you want an answer you need to help us help you. — divibisan, Aug 15 '18 at 16:02
Here's what I understand from the post: if the data has 10 observations that are 1 min apart and you find the time difference between consecutive observations (using diff etc.), then all of these observations would be discarded. However, if you were to sequentially discard the observations and recalculate the time difference each time, the third observation would have a time difference greater than 2 minutes (compared to the first, when the second is discarded) and would be retained. Is this the case? — Gautam, Aug 16 '18 at 14:38

score 0 · Answer 1 · answered Aug 15 '18 at 18:13

If your dataset is not too large, clustering is one approach to solve this problem.

library(dplyr)

data <- structure(list(deploymentid = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L), .Label = c("B4-Wintergatter_Riedlhäng", "I3-Wintergatter_Riedlhäng"), class = "factor"), species = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "Rotwild", class = "factor"), time = structure(c(1520900972, 1520900972, 1520900972, 1520900972, 1520900972, 1520900982, 1520900982, 1520900982, 1520900982, 1520900982, 1520901025, 1520901025, 1520901025, 1520901025, 1520901025, 1520975705, 1520975705, 1520975705, 1520975705, 1520975705), class = c("POSIXct", "POSIXt"), tzone = "UTC")), .Names = c("deploymentid", "species", "time"), row.names = c(NA, 20L), class = "data.frame")

data %>%
  mutate(
    # Create a numeric vector on minute scale
    minutes    = difftime(time, min(time), units = 'min') %>% as.numeric(),
    # Cluster and group based on 2 minute height
    time_group = cutree(hclust(dist(minutes)), h = 2)
  ) %>%
  # Collapse the groups of images
  group_by(deploymentid, species, time_group) %>%
  summarise(n = n(), mean_time = mean(time))

# # A tibble: 3 x 5
# # Groups:   deploymentid, species [?]
#   deploymentid              species time_group     n mean_time          
#   <fct>                     <fct>        <int> <int> <dttm>             
# 1 B4-Wintergatter_Riedlhäng Rotwild          1     5 2018-03-13 00:30:25
# 2 I3-Wintergatter_Riedlhäng Rotwild          1    10 2018-03-13 00:29:37
# 3 I3-Wintergatter_Riedlhäng Rotwild          2     5 2018-03-13 21:15:05

Thanks a lot for your answer, although it is not exactly what I wanted. At the end you got the pictures in groups of two minutes per deployment. What I need is to retain all five pictures belonging to one observation (happening within 1 second), but only select the next observation during the same deployment if it happened at least 2 minutes later. — Biomaik, Aug 17 '18 at 11:26
@biomaik Then cluster twice. Once to group “pictures” (1 second) then again to group “observations” (2 minutes). Then you can just discard picture clusters that get assigned to the same observation. — Eric, Aug 17 '18 at 20:54

score 0 · Answer 2 · answered Aug 20 '18 at 08:15

Thanks @Eric, your ideas helped me a lot to solve the problem. So here is how it worked out in the end:

# Add a column "eventid", which is unique for each event
   Deerobs$eventid<-as.factor(paste(Deerobs$Kamera_ID,Deerobs$time,sep='-'))

# Group the pictures by deployment and order them 
  library(dplyr)
  deerobs_tbl<-tbl_df(Deerobs)
  deerobs_gr<-group_by(deerobs_tbl,deploymentid)
  deerobs_or<-arrange(deerobs_gr,deerobs_gr$time,.by_group = T)

# Add two minute time groups for each deployment
  deerobs2<-deerobs_or%>%mutate(
  minu=difftime(time, min(time), units = 'min') %>% as.numeric(),
  time_group_minu = cutree(hclust(dist(minu)), h = 2))

# Add a unique ID for each time group
    deerobs2$twomin_periodid<-as.factor(paste(deerobs2$Kamera_ID,deerobs2$time_group_minu,sep='-'))
# Select only the first eventid of each time group
    deerobs_twominsub<-deerobs2[!duplicated(deerobs2$twomin_periodid),]
# Select all the rows with these event IDs
    Deerobs_subset<-subset(deerobs2,deerobs2$eventid%in%deerobs_twominsub$eventid)

How to select rows with times separated by a certain interval in R

2 Answers2