5

I didn't find a solution for this common grouping problem in R:

This is my original dataset

ID  State
1   A
2   A
3   B
4   B
5   B
6   A
7   A
8   A
9   C
10  C

This should be my grouped resulting dataset

State   min(ID) max(ID)
A       1       2
B       3       5
A       6       8
C       9       10

So the idea is to sort the dataset first by the ID column (or a timestamp column). Then all connected states with no gaps should be grouped together and the min and max ID value should be returned. It's related to the rle method, but this doesn't allow the calculation of min, max values for the groups.

Any ideas?

HansHupe
  • 476
  • 3
  • 13
  • 2
    Related post: http://stackoverflow.com/questions/37809094/create-group-names-for-consecutive-values – zx8754 Sep 15 '16 at 13:32

4 Answers4

7

You could try:

library(dplyr)
df %>%
  mutate(rleid = cumsum(State != lag(State, default = ""))) %>%
  group_by(rleid) %>%
  summarise(State = first(State), min = min(ID), max = max(ID)) %>%
  select(-rleid)

Or as per mentioned by @alistaire in the comments, you can actually mutate within group_by() with the same syntax, combining the first two steps. Stealing data.table::rleid() and using summarise_all() to simplify:

df %>% 
  group_by(State, rleid = data.table::rleid(State)) %>% 
  summarise_all(funs(min, max)) %>% 
  select(-rleid)

Which gives:

## A tibble: 4 × 3
#   State   min   max
#  <fctr> <int> <int>
#1      A     1     2
#2      B     3     5
#3      A     6     8
#4      C     9    10
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • You can actually `mutate` within `group_by` with the same syntax, combining the first two steps. Stealing `data.table::rleid` and using `summarise_all` to simplify: `df %>% group_by(State, rleid = data.table::rleid(State)) %>% summarise_all(funs(min, max)) %>% select(-rleid)` – alistaire Sep 15 '16 at 14:42
  • @alistaire Haven't thought about using `summarise_all()` there. Nice one. I updated the answer with your suggestion. – Steven Beaupré Sep 15 '16 at 15:08
5

Here is a method that uses the rle function in base R for the data set you provided.

# get the run length encoding
temp <- rle(df$State)

# construct the data.frame
newDF <- data.frame(State=temp$values,
                    min.ID=c(1, head(cumsum(temp$lengths) + 1, -1)),
                    max.ID=cumsum(temp$lengths))

which returns

newDF
  State min.ID max.ID
1     A      1      2
2     B      3      5
3     A      6      8
4     C      9     10

Note that rle requires a character vector rather than a factor, so I use the as.is argument below.


As @cryo111 notes in the comments below, the data set might be unordered timestamps that do not correspond to the lengths calculated in rle. For this method to work, you would need to first convert the timestamps to a date-time format, with a function like as.POSIXct, use df <- df[order(df$ID),], and then employ a slight alteration of the method above:

# get the run length encoding
temp <- rle(df$State)

# construct the data.frame
newDF <- data.frame(State=temp$values,
                    min.ID=df$ID[c(1, head(cumsum(temp$lengths) + 1, -1))],
                    max.ID=df$ID[cumsum(temp$lengths)])

data

df <- read.table(header=TRUE, as.is=TRUE, text="ID  State
1   A
2   A
3   B
4   B
5   B
6   A
7   A
8   A
9   C
10  C")
lmo
  • 37,904
  • 9
  • 56
  • 69
  • Would this work if `ID` is a timestamp column as mentioned by the OP? – cryo111 Sep 15 '16 at 14:19
  • In your solution, `min.ID` and `max.ID` are calculated via `rle` lengths. What if the `ID` column now holds (unordered) timestamps? I guess the OP then wants the min and max time stamp of the respective groups. – cryo111 Sep 15 '16 at 15:19
  • I see your point now. I've made an edit that should more or less address this issue. – lmo Sep 15 '16 at 16:12
4

An idea with data.table:

require(data.table)

dt <- fread("ID  State
1   A
            2   A
            3   B
            4   B
            5   B
            6   A
            7   A
            8   A
            9   C
            10  C")

dt[,rle := rleid(State)]
dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")]

which gives:

   rle State min max
1:   1     A   1   2
2:   2     B   3   5
3:   3     A   6   8
4:   4     C   9  10

The idea is to identify sequences with rleid and then get the min and max of IDby the tuple rle and State.

you can remove the rle column with

dt2[,rle:=NULL]

Chained:

 dt2<-dt[,list(min=min(ID),max=max(ID)),by=c("rle","State")][,rle:=NULL]

You can shorten the above code even more by using rleid inside by directly:

dt2 <- dt[, .(min=min(ID),max=max(ID)), by=.(State, rleid(State))][, rleid:=NULL]
Jaap
  • 81,064
  • 34
  • 182
  • 193
Tensibai
  • 15,557
  • 1
  • 37
  • 57
  • Thanks for the explanation, I didn't know about the rleid function – HansHupe Sep 15 '16 at 13:20
  • 1
    @HansHupe it's part of the `data.table` package, it ease a lot of things like this – Tensibai Sep 15 '16 at 13:20
  • 1
    `dt[, .(min = min(ID), max = max(ID)), by = .(State, rl = rleid(State))][, rl := NULL][]` is even shorter – Jaap Sep 15 '16 at 18:20
  • @pro I didn't wanted to go in advanced DT syntax, unsure I would really well explain it too. So feel free to edit to add it, or add it as another answer ;) – Tensibai Sep 15 '16 at 18:22
  • 1
    added it, its not different enough to justify a separate answer – Jaap Sep 15 '16 at 18:38
2

Here is another attempt using rle and aggregate from base R:

rl <- rle(df$State)
newdf <- data.frame(ID=df$ID, State=rep(1:length(rl$lengths),rl$lengths))
newdf <- aggregate(ID~State, newdf, FUN = function(x) c(minID=min(x), maxID=max(x)))
newdf$State <- rl$values

  # State ID.minID ID.maxID
# 1     A        1        2
# 2     B        3        5
# 3     A        6        8
# 4     C        9       10

data

df <- structure(list(ID = 1:10, State = c("A", "A", "B", "B", "B", 
"A", "A", "A", "C", "C")), .Names = c("ID", "State"), class = "data.frame", 
row.names = c(NA, 
    -10L))
989
  • 12,579
  • 5
  • 31
  • 53