2

I have a set of animal locations with different sampling intervals. What I want to do is group and label the sequences where the sampling interval matches a certain criteria (e.g. is below a certain value). This is a revision of this question which was marked as a duplicate of this one. The difference in this revised question is the fact that all values that do NOT match the criteria should be ignored, not labeled.

Let me illustrate with some dummy data:

start <- Sys.time()
timediff <- c(rep(5,3),rep(20,3),rep(5,2))
timediff <- cumsum(timediff)

# Set up a dataframe with a couple of time values
df <- data.frame(TimeDate = start + timediff)

# For understanding purposes, I will note the time differences in a separate column
df$TimeDiff <- c(diff(df$TimeDate),NA)

Using the @Josh O'Brien's answer, one could define a function that groups values which meet a specific criteria.

number.groups <- function(input){
  input[is.na(input)] <- FALSE # to eliminate NA
  return(head(cumsum(c(TRUE,!input)),-1))
}

# Define the criteria and apply the function
df$Group <- number.groups(df$TimeDiff <= 5)

# output
             TimeDate TimeDiff Group
1 2016-03-16 15:41:51        5     1
2 2016-03-16 15:41:56        5     1
3 2016-03-16 15:42:01       20     1
4 2016-03-16 15:42:21       20     2
5 2016-03-16 15:42:41       20     3
6 2016-03-16 15:43:01        5     4
7 2016-03-16 15:43:06        5     4
8 2016-03-16 15:43:11       NA     4

The issue here is that rows 4 and 5 are labeled as individual groups, rather than ignored. Is there a way to make sure that values that DO NOT belong to a group are NOT grouped (e.g. stay NA)?

Community
  • 1
  • 1
Ratnanil
  • 1,641
  • 17
  • 43
  • Why is 3rd row (20) in 1st group? – Karolis Koncevičius Mar 16 '16 at 15:37
  • I'm glad you ask, I wasn't sure whether I should clarify in the question. The column "TimeDiff" describes the differences between the rows. Since the difference between rows 2 and 3 is 5 (value in row 2), row 3 belongs to group 1 – Ratnanil Mar 16 '16 at 16:03
  • 1
    @Frank Sorry for the the unclear formulation. The idea was that the criterion can be defined according to the needs. In this example, I formulated it to be "df$TimeDiff <= 5". The desired ouput is stated in the last sentence: I would want rows 4 and 5 to belong to NO group. – Ratnanil Mar 17 '16 at 15:33

1 Answers1

1

I think I've found a way to solve the problem. The approach is to compare each value with the next and use this information to eliminate unique values. Then, rename the remaining values by turing them into factors.

number.groups <- function(input){
  # Replace NAs with FALSE for cumsum() to work
  input[is.na(input)] <- FALSE 
  # Make Groups using cumsum()
  group = (head(cumsum(c(TRUE,!input)),-1))
  # Compare each value with the next
  compare <- head(group,-1) == tail(group,-1)
  # determine unique values
  uniques <- !(c(compare,F) | c(F,compare))
  # remove unique values
  group[which(uniques)] <- NA
  # convert into factors
  group <- as.factor(group)
  # rename the factors
  levels(group) <- 1:length(levels(group))
  return(group)
}

# apply the function
df$Group <- number.groups(df$TimeDiff <= 5)

# output
             TimeDate TimeDiff Group
1 2016-03-17 15:44:28        5     1
2 2016-03-17 15:44:33        5     1
3 2016-03-17 15:44:38       20     1
4 2016-03-17 15:44:58       20  <NA>
5 2016-03-17 15:45:18       20  <NA>
6 2016-03-17 15:45:38        5     2
7 2016-03-17 15:45:43        5     2
8 2016-03-17 15:45:48       NA     2
Ratnanil
  • 1,641
  • 17
  • 43