0

I have a dataset that I am processing in R. I would like to group occurences of a category over time, indicating the order at which each group appears.

Data is grouped by "id", sampled by "time", and has a "category" label (low, high).

#make example data
id <- c("a", "a", "a", "a", "a", "b", "b", "b", "b", "b")
time <- seq.POSIXt(as.POSIXct("10/24/16 21:05", format="%m/%d/%y %H:%M", tz="America/Chicago"), by="5 min", length.out=10)
category <- c("low", "high", "high", "low", "low", "low", "high", "high", "low", "low")
dat<-data.frame(id, time, category)
> dat
   id                time category
1   a 2016-10-24 21:05:00      low
2   a 2016-10-24 21:10:00     high
3   a 2016-10-24 21:15:00     high
4   a 2016-10-24 21:20:00      low
5   a 2016-10-24 21:25:00      low
6   b 2016-10-24 21:30:00      low
7   b 2016-10-24 21:35:00     high
8   b 2016-10-24 21:40:00     high
9   b 2016-10-24 21:45:00      low
10  b 2016-10-24 21:50:00      low

I want to create a variable "group" that marks the time-group that each category belongs to, such that: If category == category at time X and time X+1, these are in the same group If category != category at time X and time X+1, the group ends

Groups are ordered over time, such that the first occurence of a given "category" group is 1, and the next is 2.

This is different than a sequence that counts the number of occurrences of each category label over time. While I need the "group" values to be sequenced, I need the value to repeat within each "group" of sequential "category".

#example data of what I want 
dat$group <- c(1, 1, 1, 2, 2, 1, 1, 1, 2, 2) #this is the variable I want 
dat$seq <- with(dat, ave(as.character(category), category, FUN = seq_along)) #count sequence variable, which is different than what I'm after because it does not repeat within sequential categories
> dat
   id                time category group seq
1   a 2016-10-24 21:05:00      low     1   1
2   a 2016-10-24 21:10:00     high     1   1
3   a 2016-10-24 21:15:00     high     1   2
4   a 2016-10-24 21:20:00      low     2   2
5   a 2016-10-24 21:25:00      low     2   3
6   b 2016-10-24 21:30:00      low     1   4
7   b 2016-10-24 21:35:00     high     1   3
8   b 2016-10-24 21:40:00     high     1   4
9   b 2016-10-24 21:45:00      low     2   5
10  b 2016-10-24 21:50:00      low     2   6

Basically, the idea is that "group" is an event, which can take place over a varying length of time. But even if it varies in length, it's still the same event. So you have the first event, second event, etc.

I have searched online, but not seen a post that matches the question. If I have overlooked a previous post, links to relevant posts are welcome.

Thanks in advance for any help!

Edited, 12/14/2016 for clarity and to try and garner interest.

griffmer
  • 357
  • 2
  • 9

2 Answers2

1

With base R:

dat$episode <- with(dat, ave(as.character(category), category, FUN = seq_along))

Or with rowid from the development version of data.table:

library(data.table)
setDT(dat)[, episode := rowid(category)]

Or with dplyr:

library(dplyr)
dat %>% 
  group_by(category) %>% 
  mutate(episode = row_number())

All will give the desired end result (output from dplyr is shown):

                  time category episode
                <dttm>   <fctr>   <int>
1  2016-10-24 21:05:00      low       1
2  2016-10-24 21:10:00     high       1
3  2016-10-24 21:15:00     high       2
4  2016-10-24 21:20:00      low       2
5  2016-10-24 21:25:00      low       3
6  2016-10-24 21:30:00   normal       1
7  2016-10-24 21:35:00     high       3
8  2016-10-24 21:40:00     high       4
9  2016-10-24 21:45:00      low       4
10 2016-10-24 21:50:00     high       5
h3rm4n
  • 4,126
  • 15
  • 21
  • You can also do this with the main branch of `data.table` by doing `1:.N` (per my answer below), so I don't think it's necessary to get the development version. – Alexey Shiklomanov Nov 14 '16 at 20:03
  • Thanks for the great answer and elegant use of dplyr! This is close, but a sequence is not exactly what I'm after. I've edited my question to display the sequence variable next to the variable I'm trying to compute to highlight the difference. Also, edited question for clarity. Thanks again for your help! – griffmer Nov 15 '16 at 04:45
0

I'm not 100% sure this is what you're after, but assuming I understood your question, you can accomplish this pretty simply and elegantly using the data.table package.

library(data.table)
setDT(dat)
dat[, episode := 1:.N, by = category]

#                    time category episode
#  1: 2016-10-24 21:05:00      low       1
#  2: 2016-10-24 21:10:00     high       1
#  3: 2016-10-24 21:15:00     high       2
#  4: 2016-10-24 21:20:00      low       2
#  5: 2016-10-24 21:25:00      low       3
#  6: 2016-10-24 21:30:00   normal       1
#  7: 2016-10-24 21:35:00     high       3
#  8: 2016-10-24 21:40:00     high       4
#  9: 2016-10-24 21:45:00      low       4
# 10: 2016-10-24 21:50:00     high       5

Explanation: The .N is a special variable containing the number of counts in each group (as set by the by). The := is special data.table syntax for assigning values.

Alexey Shiklomanov
  • 1,592
  • 13
  • 23
  • Thanks for the great answer! This is close, but a sequence is not exactly what I'm after. I've edited my question to display the sequence variable next to the variable I'm trying to compute to highlight the difference. Also, edited question for clarity. Thanks again for your help! – griffmer Nov 15 '16 at 04:45
  • Hmm, I see. You may be able to achieve this with some cleverness via the `duplicated` function. If I come up with a full solution, I'll post it. – Alexey Shiklomanov Nov 15 '16 at 04:48