Optimize loop for index creation in R

Question

I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I've implemented it with a for loop, but I have about 60000 records, so it's super slow. Any idea on how to do it faster? Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.

Here is my solution:

# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

# create index column
df$region_id <- 1

# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
  if(df$region[ii]!=df$region[ii-1]) {
    df$region_id[ii] <- df$region_id[ii-1]+1
  } else {
    df$region_id[ii] <- df$region_id[ii-1]
  }
}

And I get something like this:

   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

For 56373 records this takes:

  user  system elapsed 
  36.70    0.20   36.91

Any help will be much appreciated.

Thanks

benson23 · Answer 1 · 2023-03-15T08:29:43.847

You can try using the dplyr package for this to avoid a for loop.

Use consecutive_id from dplyr version >= 1.1.0 (or data.table::rleid() if your dplyr version is not up-to-date)

df %>% mutate(region_id = consecutive_id(region))

Or first create a tmp column indicating whether the previous region is the same. Then use cumsum to sum this logical column.

library(dplyr)

df %>% 
  mutate(tmp = region != lag(region),
         region_id = 1 + cumsum(ifelse(is.na(tmp), F, tmp))) %>% 
  select(-tmp)

Output

   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

Thanks! I did not know consecutive_id either!! The rle solution with base R is 1 second faster, that's why I accepted that one. :) — Piera, Mar 15 '23 at 08:37

Rui Barradas · Accepted Answer · 2023-03-15T08:55:09.267

Here is a base R way with rle.

df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),
                 date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))

r <- rle(df$region)
r$values <- seq_along(r$values)
inverse.rle(r)
#>  [1] 1 1 1 2 2 2 2 2 2 2 2 3 3 4 4 4 4 4 4 4 5 5 5

df$region_id <- inverse.rle(r)
df
#>    region       date region_id
#> 1       A 2020-03-20         1
#> 2       A 2020-03-21         1
#> 3       A 2020-03-23         1
#> 4       B 2020-03-25         2
#> 5       B 2020-03-27         2
#> 6       B 2020-03-29         2
#> 7       B 2020-03-31         2
#> 8       B 2020-04-02         2
#> 9       B 2020-04-04         2
#> 10      B 2020-04-06         2
#> 11      B 2020-04-08         2
#> 12      C 2020-04-10         3
#> 13      C 2020-04-12         3
#> 14      A 2020-04-14         4
#> 15      A 2020-04-16         4
#> 16      A 2020-04-18         4
#> 17      A 2020-04-20         4
#> 18      A 2020-04-22         4
#> 19      A 2020-04-24         4
#> 20      A 2020-04-26         4
#> 21      C 2020-04-28         5
#> 22      C 2020-04-30         5
#> 23      C 2020-05-02         5
က
#> Error in eval(expr, envir, enclos): object 'က' not found

^{Created on 2023-03-15 with reprex v2.0.2}

Edit

With package data.table there is also rleid. This will advantageous with bigger data.frames.

library(data.table)

dt1 <- as.data.table(df)
dt1[, region_id := rleid(region)]
dt1
#>     region       date region_id
#>  1:      A 2020-03-20         1
#>  2:      A 2020-03-21         1
#>  3:      A 2020-03-23         1
#>  4:      B 2020-03-25         2
#>  5:      B 2020-03-27         2
#>  6:      B 2020-03-29         2
#>  7:      B 2020-03-31         2
#>  8:      B 2020-04-02         2
#>  9:      B 2020-04-04         2
#> 10:      B 2020-04-06         2
#> 11:      B 2020-04-08         2
#> 12:      C 2020-04-10         3
#> 13:      C 2020-04-12         3
#> 14:      A 2020-04-14         4
#> 15:      A 2020-04-16         4
#> 16:      A 2020-04-18         4
#> 17:      A 2020-04-20         4
#> 18:      A 2020-04-22         4
#> 19:      A 2020-04-24         4
#> 20:      A 2020-04-26         4
#> 21:      C 2020-04-28         5
#> 22:      C 2020-04-30         5
#> 23:      C 2020-05-02         5
#>     region       date region_id

^{Created on 2023-03-15 with reprex v2.0.2}

George Savva · Answer 3 · 2023-03-15T11:19:07.887

You could use base R and cumsum. This is effectively the same as the existing answer using dplyr::lag but with base only.

> df$region_id <- cumsum(df$region != c(FALSE,df$region[-length(df$region)]))
> df
   region       date region_id
1       A 2020-03-20         1
2       A 2020-03-21         1
3       A 2020-03-23         1
4       B 2020-03-25         2
5       B 2020-03-27         2
6       B 2020-03-29         2
7       B 2020-03-31         2
8       B 2020-04-02         2
9       B 2020-04-04         2
10      B 2020-04-06         2
11      B 2020-04-08         2
12      C 2020-04-10         3
13      C 2020-04-12         3
14      A 2020-04-14         4
15      A 2020-04-16         4
16      A 2020-04-18         4
17      A 2020-04-20         4
18      A 2020-04-22         4
19      A 2020-04-24         4
20      A 2020-04-26         4
21      C 2020-04-28         5
22      C 2020-04-30         5
23      C 2020-05-02         5

Maybe `c(FALSE, ...)` is a bit better, to avoid having a region name equals `0` (though unlikely). But I like your base R version! — benson23, Mar 15 '23 at 08:38

Optimize loop for index creation in R

3 Answers3

Output

Edit