Alternatives to a for loop with indexing - R

Question

I am converting unstructured data into a long format and need to create an ID (grouping) variable. I want to assign an ID variable based on sets of values contained in another variable. More specifically, consider the following data set.

set.seed(1234); x.1 <- rep(letters[1:5], 10)
x.2 <- sample(c(0:10), 50, replace=TRUE)
x.3 <- rep(NA, 50); df <- data.frame(x.1, x.2, x.3) 
df <- df[-c(2, 19),]

A unique case can be identified from the x.1 variable -- it starts with a and ends with e. This is always the case. x.3 will hold the ID (grouping) variable.

> head(df, 9)
x.1 x.2 x.3
a   1    NA
c   6    NA
d   6    NA
e   9    NA
a   7    NA
b   0    NA
c   2    NA
d   7    NA
e   5    NA

The number of records between a and e for a given case can vary considerably (in the real data file). Thus, I cannot assign a unique ID by simply dividing the variable by a fixed number of records. I figured out how to make the proper assignment by using a for loop:

START <- which(df$x.1== "a")
END <- which(df$x.1 == "e")
for(i in 1:length(START)){df$x.3[START[i]:END[i]] <- i}

head(df, 9)
x.1 x.2 x.3
a   1    1
c   6    1
d   6    1
e   9    1
a   7    2
b   0    2
c   2    2
d   7    2
e   5    2

The obvious problem with this approach is that it is much too slow for a data set with over one million records. It seems that lapply could be an alternative, but I can't seem to figure out how to specify when a case ends and a new one begins as it traverses down through the data file. And, feel free to point me to an existing answer if one exists -- I didn't fine one!

Thanks in advance.

talat · Accepted Answer · 2015-01-26T21:15:03.987

If there are no gaps between groups, i.e. after each "e" follows an "a" for the next group, you can use cumsum easily:

df$x.3 <- cumsum(df$x.1 == "a")
df
#   x.1 x.2 x.3
#1    a   1   1
#3    c   6   1
#4    d   6   1
#5    e   9   1
#6    a   7   2
#7    b   0   2
#8    c   2   2
#9    d   7   2
#10   e   5   2
#11   a   7   3
#12   b   5   3
#13   c   3   3
#...

And if your data was enormously large you could use data.table to update the data by reference:

library(data.table)
setDT(df)[, x.3 := cumsum(x.1 == "a")]

As correctly noted by @nicola in the comments, this assumes that as only appear at beginngs of groups, not in the middle of them. Based on the sample data, this seems like a valid assumption.

How it works:

Let's take a subset of column "x.1":

x <- df$x.1[1:15]
x
# [1] a c d e a b c d e a b c d e a
#Levels: a b c d e

You can now check if x is equal to "a" which will create a logical vector:

x == "a"
# [1]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE

Now, what cumsum does: it adds up cumulatively all the TRUE values (which are 1s essentially):

cumsum(x == "a")
# [1] 1 1 1 1 2 2 2 2 2 3 3 3 3 3 4

So you can use logical vectors like numeric vectors and do mathematical calculations with them like a vector of 1s and 0s.

That is elegant. Could you describe how cumsum is actually doing this? It is exactly right, but I don't fully understand the logic. — Brian P, Jan 26 '15 at 21:04
+1. However this can fail if an "a" can repeat. In this case, a more general (although quite slower) solution could be `cumsum(c(TRUE,df$x.1[1:(nrow(df)-1)]=="e" & df$x.1[2:nrow(df)]=="a"))`, with the only condition that a case ends with an "e" and starts with an "a". — nicola, Jan 26 '15 at 21:07

Alternatives to a for loop with indexing - R

1 Answers1

Linked