1

I have a dataframe where one column indicates text lines, which are only partially consecutive:

df <- data.frame(
  line = c("0001","0002", "0003", "0011","0012","0234","0235","0236")
)

I want to group the rows based on consecutive line numbers to get this expected result:

df
  line grp
1 0001   1
2 0002   1
3 0003   1
4 0011   2
5 0012   2
6 0234   3
7 0235   3
8 0236   3

I've tried to approach this with dplyr's lag function but am stuck there:

library(dplyr)
df %>%
  mutate(line = as.numeric(line),
         diff = abs(lag(line) - line))
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34
  • The `cumsum(...diff(...) )` idiom to create a grouping variable is described here: [Create grouping variable for consecutive sequences and split vector](https://stackoverflow.com/questions/5222061/create-grouping-variable-for-consecutive-sequences-and-split-vector) – Henrik Mar 26 '21 at 10:07

2 Answers2

3

Convert the numbers to numeric, calculate difference between consecutive numbers and increment the group count when the difference is greater than 1.

transform(df, group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))

#  line group
#1 0001     1
#2 0002     1
#3 0003     1
#4 0011     2
#5 0012     2
#6 0234     3
#7 0235     3
#8 0236     3

If you want to use dplyr :

library(dplyr)
df %>% mutate(group = cumsum(c(TRUE, diff(as.numeric(line)) > 1)))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

Does this work:

library(dplyr)
library(stringr)
library(data.table)
df %>% mutate(z = str_count(line, '0'), grp = rleid(z)) %>% select(-z)
  line grp
1 0001   1
2 0002   1
3 0003   1
4 0011   2
5 0012   2
6 0234   3
7 0235   3
8 0236   3
Karthik S
  • 11,348
  • 2
  • 11
  • 25
  • That's quite clever but only works if the groups are distinguished by the number of `0`in `line`, which is the case in the toy data but not in the actual data. – Chris Ruehlemann Mar 26 '21 at 10:16