Another way:
# assuming the data.frame is already sorted by
# all three columns (unfortunately, this is a requirement)
> sequence(rle(do.call(paste, df))$lengths)
# [1] 1 1 1 2 3 1 1 1 1 2
Break down:
do.call(paste, df) # pastes each row of df together with default separator "space"
# [1] "A M 32" "A M 33" "A F 35" "A F 35" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"
# [10] "B F 28"
rle(.) # gets the run length vector
# Run Length Encoding
# lengths: int [1:7] 1 1 3 1 1 1 2
# values : chr [1:7] "A M 32" "A M 33" "A F 35" "A F 39" "B M 30" "B F 25" "B F 28"
$lengths # get the run-lengths (as opposed to values)
# [1] 1 1 3 1 1 1 2
sequence(.) # get 1:n for each n
# [1] 1 1 1 2 3 1 1 1 1 2
Benchmarking:
Since there are quite a few solutions, I thought I'd benchmark this on a relatively huge data.frame
. So, here are the results (I also added a solution data.table
).
Here's the data:
require(data.table)
require(plyr)
set.seed(45)
length <- 1e3 # number of rows in `df`
df <- data.frame(Place = sample(letters[1:20], length, replace=T),
Sex = sample(c("M", "F"), length, replace=T),
Length = sample(1:75, length, replace=T))
df <- df[with(df, order(Place, Sex, Length)), ]
Ananda's ave
solution:
AVE_FUN <- function(x) {
i <- interaction(x)
within(x, {
ID <- ave(as.character(i), i, FUN = seq_along)
})
}
Arun's rle
solution:
RLE_FUN <- function(x) {
x <- transform(x, ID = sequence(rle(do.call(paste, df))$lengths))
}
Ben's plyr
solution:
PLYR_FUN <- function(x) {
ddply(x, c("Place", "Sex", "Length"), transform, ID = seq_along(Length))
}
At last, the data.table
solution:
DT_FUN <- function(x) {
dt <- data.table(x)
dt[, ID := seq_along(.I), by=names(dt)]
}
Benchmarking code:
require(rbenchmark)
benchmark(d1 <- AVE_FUN(df),
d2 <- RLE_FUN(df),
d3 <- PLYR_FUN(df),
d4 <- DT_FUN(df),
replications = 5, order = "elapsed")
Results:
With length = 1e3
(number of rows in data.frame df)
# test replications elapsed relative user.self
# 2 d2 <- RLE_FUN(df) 5 0.013 1.000 0.013
# 4 d4 <- DT_FUN(df) 5 0.017 1.308 0.016
# 1 d1 <- AVE_FUN(df) 5 0.052 4.000 0.052
# 3 d3 <- PLYR_FUN(df) 5 4.629 356.077 4.452
With length = 1e4
:
# test replications elapsed relative user.self
# 4 d4 <- DT_FUN(df) 5 0.033 1.000 0.031
# 2 d2 <- RLE_FUN(df) 5 0.089 2.697 0.088
# 1 d1 <- AVE_FUN(df) 5 0.102 3.091 0.100
# 3 d3 <- PLYR_FUN(df) 5 23.103 700.091 20.659
With length = 1e5
:
# test replications elapsed relative user.self
# 4 d4 <- DT_FUN(df) 5 0.179 1.000 0.130
# 1 d1 <- AVE_FUN(df) 5 1.001 5.592 0.940
# 2 d2 <- RLE_FUN(df) 5 1.098 6.134 1.011
# 3 d3 <- PLYR_FUN(df) 5 219.861 1228.274 147.545
Observation: The trend I notice is that with bigger and bigger data, data.table
(not surprisingly) does the best (scales really well), while ave
and rle
being quite close competitors for second place (ave
scales better than rle
). plyr
performs quite bad on all datasets, unfortunately.
Note: Ananda's solution gives character
output and I kept it as such in the benchmarking.