I am converting unstructured data into a long format and need to create an ID (grouping) variable. I want to assign an ID variable based on sets of values contained in another variable. More specifically, consider the following data set.
set.seed(1234); x.1 <- rep(letters[1:5], 10)
x.2 <- sample(c(0:10), 50, replace=TRUE)
x.3 <- rep(NA, 50); df <- data.frame(x.1, x.2, x.3)
df <- df[-c(2, 19),]
A unique case can be identified from the x.1 variable -- it starts with a
and ends with e
. This is always the case. x.3 will hold the ID (grouping) variable.
> head(df, 9)
x.1 x.2 x.3
a 1 NA
c 6 NA
d 6 NA
e 9 NA
a 7 NA
b 0 NA
c 2 NA
d 7 NA
e 5 NA
The number of records between a
and e
for a given case can vary considerably (in the real data file). Thus, I cannot assign a unique ID by simply dividing the variable by a fixed number of records. I figured out how to make the proper assignment by using a for loop:
START <- which(df$x.1== "a")
END <- which(df$x.1 == "e")
for(i in 1:length(START)){df$x.3[START[i]:END[i]] <- i}
head(df, 9)
x.1 x.2 x.3
a 1 1
c 6 1
d 6 1
e 9 1
a 7 2
b 0 2
c 2 2
d 7 2
e 5 2
The obvious problem with this approach is that it is much too slow for a data set with over one million records. It seems that lapply
could be an alternative, but I can't seem to figure out how to specify when a case ends and a new one begins as it traverses down through the data file. And, feel free to point me to an existing answer if one exists -- I didn't fine one!
Thanks in advance.