Grouping rows on the basis of row differences in R

Question

I have a set of animal locations with different sampling intervals. What I want to do is group and seqences where the sampling interval matches a certain criteria (e.g. is below a certain value). Let me illustrate with some dummy data:

start <- Sys.time()
timediff <- c(rep(5,3),20,rep(5,2))
timediff <- cumsum(timediff)

# Set up a dataframe with a couple of time values
df <- data.frame(TimeDate = start + timediff)

# Calculate the time differences between the rows
df$TimeDiff <- c(as.integer(tail(df$TimeDate,-1) - head(df$TimeDate,-1)),NA)

# Define a criteria in order to form groups
df$TimeDiffSmall <- df$TimeDiff <= 5

             TimeDate TimeDiff TimeDiffSmall
1 2016-03-15 23:11:49        5          TRUE
2 2016-03-15 23:11:54        5          TRUE
3 2016-03-15 23:11:59       20         FALSE
4 2016-03-15 23:12:19        5          TRUE
5 2016-03-15 23:12:24        5          TRUE
6 2016-03-15 23:12:29       NA            NA

In this dummy data, rows 1:3 belong to one group, since the time difference between them is <= 5 seconds. 4 - 6 belong to the second group, but hypothetically there could be a number of rows in between the two groups that dont belong to any group (TimeDiffSmall equals to FALSE).

Combining the information from two multiple SO answers (e.g. part 1), I've create a function that solves this problem.

number.groups <- function(input){
  # part 1: numbering successive TRUE values
  input[is.na(input)] <- F
  x.gr <- ifelse(x <- input == TRUE, cumsum(c(head(x, 1), tail(x, -1) - head(x, -1) == 1)),NA)
  # part 2: including last value into group
  items <- which(!is.na(x.gr))
  items.plus <- c(1,items+1)
  sel <- !(items.plus %in% items)
  sel.idx <- items.plus[sel]
  x.gr[sel.idx] <- x.gr[sel.idx-1]
  return(x.gr)


 # Apply the function to create groups
 df$Group <- number.groups(df$TimeDiffSmall)

             TimeDate TimeDiff TimeDiffSmall Group
1 2016-03-15 23:11:49        5          TRUE     1
2 2016-03-15 23:11:54        5          TRUE     1
3 2016-03-15 23:11:59       20         FALSE     1
4 2016-03-15 23:12:19        5          TRUE     2
5 2016-03-15 23:12:24        5          TRUE     2
6 2016-03-15 23:12:29       NA            NA     2

This function actually works to solve my problem. This this is, it seems like a crazy and rookie way to go about this. Is there a function that could solve my problem more professionally?

Does `cumsum(c(TRUE, diff(df$TimeDate) > 5))` do it for your larger example? — thelatemail, Mar 15 '16 at 22:41

Josh O'Brien · Accepted Answer · 2016-03-15T22:58:46.943

2

Like @thelatemail, I'd use the following to get the group IDs. It works because cumsum() will end up incrementing the group count each time it reaches an element preceded by a greater-than-5-second time interval.

df$Group <- cumsum(c(TRUE, diff(df$TimeDate) > 5))
df$Group
# [1] 1 1 1 2 2 2

edited Mar 15 '16 at 22:58

answered Mar 15 '16 at 22:53

Josh O'Brien

159,210
26
366
455

Or `cumsum(c(FALSE,!(diff(df$TimeDate) <= 5)))` if you want to keep framing the selection in the way it is, rather than the way it isn't. – thelatemail Mar 15 '16 at 23:08
@thelatemail That's what I started out with actually, and when I saw I'd then need to add one to the result (or change the initial `FALSE` to a `TRUE`) to get group numbers starting with one, I flipped it all around to what seems the simpler incantation. – Josh O'Brien Mar 15 '16 at 23:11
Fair enough - it depends I suppose if the selection criteria is complex. Then negating it is easier than trying to reverse it all manually and ensuring the `&`'s and `|`'s all are correct. – thelatemail Mar 15 '16 at 23:38
@thelatemail You're right. Looking back at times I've used it in the past (e.g. [here](http://stackoverflow.com/questions/8171203/cumulative-sums-over-run-lengths-can-this-loop-be-vectorized/8171651#8171651)), it looks like that's what I've more often ended up using, and I think you've just identified why that's so. – Josh O'Brien Mar 15 '16 at 23:41
Thank you for your answer and sorry for the duplicate post. Now that the discussion is running, let me adress a small issue I have with your answer (although I'm blown away by the beauty of its simplicity): If I have a timelag of > 5 seconds between the rows, I would want those values to belong to no group (NA). I've slightly updated my dummy data to address this point. Using your function, rows 4 and 5 now belong to their own group (2 and 3). Is there a way so elegantly solve this problem? – Ratnanil Mar 16 '16 at 11:26
Hi. If you haven't yet figured that out, I'd suggest asking that as a new question that makes reference back to this one, while reverting this question back to its previous state. Cheers. – Josh O'Brien Mar 16 '16 at 13:35
ok, just did as you suggested. Here's the new question: http://stackoverflow.com/q/36039026/4139249 – Ratnanil Mar 16 '16 at 14:48

Grouping rows on the basis of row differences in R

1 Answers1

Linked