Group by sequential data in R

Question

I have the following data frame in R:

gene_name           gene_number
ENSMUSG00000000001  4732
ENSMUSG00000000001  4733
ENSMUSG00000000058  7603
ENSMUSG00000000058  7604
ENSMUSG00000000058  8246
ENSMUSG00000000058  8248
ENSMUSG00000000058  9001

The data is grouped by gene_name column, and the gene_number is sorted by other parameters (not relevant for the question). I want to sub-group the data according to the gene_number. Inside each group, I want to sub group the data if the values in gene_number are not sequential / consecutive or the maximum differences between following rows is 2. If there is only 1 value without sequential value, I would like to remover it.

I want to have a new column specifying the new groups.

For example, in the data above:

ENSMUSG00000000001  4732  1
ENSMUSG00000000001  4733  1
ENSMUSG00000000058  7603  2
ENSMUSG00000000058  7604  2 
ENSMUSG00000000058  8246  3
ENSMUSG00000000058  8248  3

Thank you!

score 1 · Accepted Answer · answered Jun 23 '21 at 14:17

Here is one dplyr option -

library(dplyr)

df %>%
  group_by(gene_name) %>%
  mutate(grp =  gene_number - lag(gene_number, default = 0) > 2) %>%
  group_by(grp = cumsum(grp)) %>%
  filter(n() > 1) %>%
  ungroup
  
#  gene_name          gene_number   grp
#  <chr>                    <int> <int>
#1 ENSMUSG00000000001        4732     1
#2 ENSMUSG00000000001        4733     1
#3 ENSMUSG00000000058        7603     2
#4 ENSMUSG00000000058        7604     2
#5 ENSMUSG00000000058        8246     3
#6 ENSMUSG00000000058        8248     3

For each gene_name subtract the current gene_number value with the previous one and increment the group count if the difference is greater than 2. Drop the row if a group has a single row in it.

data

df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001", 
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058", 
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L, 
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)), 
class = "data.frame", row.names = c(NA, -7L))

Thank you! This is exactly what I was looking for! – Rachel Rap Jun 23 '21 at 16:09 — Rachel Rap, Jun 23 '21 at 16:09

score 1 · Answer 2 · answered Jun 23 '21 at 18:34

Using data.table

library(data.table)
setDT(df)[, grp := c(TRUE, diff(gene_number) > 2), gene_name][,
     grp := cumsum(grp)][, .SD[.N>1], grp]
   grp          gene_name gene_number
1:   1 ENSMUSG00000000001        4732
2:   1 ENSMUSG00000000001        4733
3:   2 ENSMUSG00000000058        7603
4:   2 ENSMUSG00000000058        7604
5:   3 ENSMUSG00000000058        8246
6:   3 ENSMUSG00000000058        8248

data

df <- structure(list(gene_name = c("ENSMUSG00000000001", "ENSMUSG00000000001", 
"ENSMUSG00000000058", "ENSMUSG00000000058", "ENSMUSG00000000058", 
"ENSMUSG00000000058", "ENSMUSG00000000058"), gene_number = c(4732L, 
4733L, 7603L, 7604L, 8246L, 8248L, 9001L)), 
class = "data.frame", row.names = c(NA, -7L))

Group by sequential data in R

2 Answers2

data