3

What is an efficient way to create a sequence of numbers that increments for each change in a group variable? As a toy example, using the data frame below, I would like a new variable, "Value", to take on the values c(1,1,1,2,2,3,3,4). Note that even though 48 repeats itself, "Value" still increases as I'm only concerned with a change in the sequence.

df <- read.table(textConnection(
  'Group 
  48 
  48
  48
  56
  56
  48
  48
  14'), header = TRUE)

One way to do this is

df$Value<-1
for(i in 2:nrow(df)){
if(df[i,]$Group==df[i-1,]$Group){df[i,]$Value=df[i-1,]$Value}
else{df[i,]$Value=df[i-1,]$Value+1}
}

but this is very slow. My actual dataset has several million observations.

Note: I had a difficult time wording the title of this question so please change it if you'd like.

Remy M
  • 599
  • 1
  • 4
  • 17

4 Answers4

5

We also could hack the rle.

r <- rle(df$Group)
r$values <- seq_along(r$lengths)
inverse.rle(r)
# [1] 1 1 1 2 2 3 3 4

Data

df <- structure(list(Group = c(48L, 48L, 48L, 56L, 56L, 48L, 48L, 14L
)), class = "data.frame", row.names = c(NA, -8L))
jay.sf
  • 60,139
  • 8
  • 53
  • 110
2

Inspired by this post: https://stackoverflow.com/a/44512144/3772141

Just do this:

library(dplyr)

df %>%
  mutate(Value = cumsum(Group != lag(Group) | row_number() == 1))

Result:

# Group Value
#    48     1
#    48     1
#    48     1
#    56     2
#    56     2
#    48     3
#    48     3
#    14     4

How it works:

  1. compare Value with Value from the previous row. If it changes set it to TRUE, indicating here begins an new Value: Group != lag(Group)
  2. The first element the lag function returns is NA. But it should always to be TRUE for the first row: | row_number() == 1
  3. TRUE and FALSE can be represented as 1 and 0. So with the cumsum function Value gets incremented whenever the inner expression returns TRUE which is when Group changes.
mirirai
  • 1,365
  • 9
  • 25
0

How about

library(tidyverse)
df = data.frame(Group = c(48, 
                      48,
                      48,
                      56,
                      56,
                      48,
                      48,
                      14))

# Get unique values in group
unique_vals = unique(df$Group)

# create a sequence from 1 up until the length of the unique values vector
sequential_nums = 1:length(unique_vals)

# Create a new column looking up the current value in the unique_vals list
# and replacing it with the correct sequential number
df %>% 
  mutate(Value = sequential_nums[match(Group, unique_vals)])

# Group      Value 
# 1    48         1
# 2    48         1
# 3    48         1
# 4    56         2
# 5    56         2
# 6    48         1
# 7    48         1
# 8    14         3
lsfischer
  • 344
  • 2
  • 14
0

If you're in the tidyverse, dplyr 1.1.0 has the function consecutive_id() which does exactly what you want! The tidyverse team recommend it for use on Zoom call transcripts, where consecutive lines attributed to the same speaker should be grouped together as a single idea: https://www.tidyverse.org/blog/2023/02/dplyr-1-1-0-vctrs/#consecutive_id

library(dplyr)
df <- read.table(textConnection(
  'Group 
  48 
  48
  48
  56
  56
  48
  48
  14'), header = TRUE)

df |> mutate(value = consecutive_id(Group))
#>   Group value
#> 1    48     1
#> 2    48     1
#> 3    48     1
#> 4    56     2
#> 5    56     2
#> 6    48     3
#> 7    48     3
#> 8    14     4
Silverfish
  • 1,812
  • 1
  • 22
  • 30