8

Looks like an easy task, can't figure out a simpler way. I have an x vector below, and need to create group names for consecutive values. My attempt was using rle, better ideas?

# data
x <- c(1,1,1,2,2,2,3,2,2,1,1)

# make groups
rep(paste0("Group_", 1:length(rle(x)$lengths)), rle(x)$lengths)
# [1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4"
# [9] "Group_4" "Group_5" "Group_5"
zx8754
  • 52,746
  • 12
  • 114
  • 209
  • why non using paste directly ?`paste0('groupe_', c(1,1,1,2,2,2,3,2,2,1,1))` – Mamoun Benghezal Jun 14 '16 at 10:18
  • 2
    because the last two groups will be 2 and 1 instead of 4 and 5 if paste directly – Sotos Jun 14 '16 at 10:18
  • @MamounBenghezal please check the expected output, first `1` is a `Group_1`, and last `1` is a `Group_5` – zx8754 Jun 14 '16 at 10:19
  • Nice attempt. A key line in the source code of `rle` makes use of `diff` as @Roland did below. – Joseph Wood Jun 14 '16 at 13:12
  • But.. having done that, how do you map these `Group_x` names to the actual values & run lengths? That is, what's the point of this exercise? – Carl Witthoft Jun 14 '16 at 14:07
  • 1
    @CarlWitthoft names are in the same order as the values, so direct map, i.e.: `names(x) <- myGroups`. My actual data is data.frame, so I can apply the same and create a `Group` column for aggregate functions down the line. – zx8754 Jun 14 '16 at 14:12

4 Answers4

11

Using rleid from data.table,

library(data.table)

rleid(x, prefix = "Group_")
#[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"
zx8754
  • 52,746
  • 12
  • 114
  • 209
Sotos
  • 51,121
  • 6
  • 32
  • 66
10

Using diff and cumsum :

paste0("Group_", cumsum(c(1, diff(x) != 0)))
#[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"

(If your values are floating point values, you might have to avoid != and use a tolerance instead.)

Roland
  • 127,288
  • 10
  • 191
  • 288
  • If they might not be numeric - `paste0("Group_", cumsum(c(TRUE, head(x,-1)!=tail(x,-1))))` – thelatemail Jun 14 '16 at 10:33
  • My numbers have no floating points, so `!=` should be OK, but what do you mean by tolerance? – zx8754 Jun 14 '16 at 10:39
  • 2
    `abs(diff(x)) < tol` with `tol` based on `help(".Machine")`. – Roland Jun 14 '16 at 10:40
  • 3
    Nice - I'm guessing this is faster than `rle(x)` and processing the output from that. OTOH, I would want to know how to map the group names to the runs, in which case might as well use `rle(x)$lengths` . – Carl Witthoft Jun 14 '16 at 14:08
3

Using cumsum but not relying on the data being numeric:

paste0("Group_", 1 + c(0, cumsum(x[-length(x)] != x[-1])))


[1] "Group_1" "Group_1" "Group_1" "Group_2" "Group_2" "Group_2" "Group_3" "Group_4" "Group_4" "Group_5" "Group_5"
2

group() from groupdata2 can create groups from a list of group starting points, using the l_starts method. By setting n to auto, it automatically finds group starts:

x <- c(1,1,1,2,2,2,3,2,2,1,1)
groupdata2::group(x, n = "auto", method = "l_starts")

## # A tibble: 11 x 2
## # Groups:   .groups [5]
##     data .groups
##    <dbl> <fct>  
##  1     1 1      
##  2     1 1      
##  3     1 1      
##  4     2 2      
##  5     2 2      
##  6     2 2      
##  7     3 3      
##  8     2 4      
##  9     2 4      
## 10     1 5      
## 11     1 5     

There's also the differs_from_previous() function which finds values, or indices of values, that differ from the previous value by some threshold(s).

# The values to start groups at
differs_from_previous(x, threshold = 1,
                      direction = "both")
## [1] 2 3 2 1

# The indices to start groups at
differs_from_previous(x, threshold = 1,
                      direction = "both",
                      return_index = TRUE)
## [1] 4 7 8 10
ludvigolsen
  • 181
  • 8