I have a dataframe with records of trawl stations in different regions. I need to create a sequential index that changes every time the region changes. I've implemented it with a for loop, but I have about 60000 records, so it's super slow. Any idea on how to do it faster? Please note that I cannot simply group by region, because I need to keep the region in the order they were sampled.
Here is my solution:
# Create dataframe
df <- data.frame(region=c(rep("A",3),rep("B",8),rep("C",2),rep("A",7),rep("C",3)),date=seq.Date(from=as.Date("2020-03-20"),to=as.Date("2020-05-02"),length.out=23))
# create index column
df$region_id <- 1
# loop through each row to check if different from previous row. If different the id changes.
for(ii in 2:nrow(df)){
if(df$region[ii]!=df$region[ii-1]) {
df$region_id[ii] <- df$region_id[ii-1]+1
} else {
df$region_id[ii] <- df$region_id[ii-1]
}
}
And I get something like this:
region date region_id
1 A 2020-03-20 1
2 A 2020-03-21 1
3 A 2020-03-23 1
4 B 2020-03-25 2
5 B 2020-03-27 2
6 B 2020-03-29 2
7 B 2020-03-31 2
8 B 2020-04-02 2
9 B 2020-04-04 2
10 B 2020-04-06 2
11 B 2020-04-08 2
12 C 2020-04-10 3
13 C 2020-04-12 3
14 A 2020-04-14 4
15 A 2020-04-16 4
16 A 2020-04-18 4
17 A 2020-04-20 4
18 A 2020-04-22 4
19 A 2020-04-24 4
20 A 2020-04-26 4
21 C 2020-04-28 5
22 C 2020-04-30 5
23 C 2020-05-02 5
For 56373 records this takes:
user system elapsed
36.70 0.20 36.91
Any help will be much appreciated.
Thanks