Update in light of OP's comment
If doing this on million+ rows, all options thus supplied will be slow. Here are some comparison timings on a dummy data set of 100,000 rows:
set.seed(12)
DF3 <- data.frame(id = sample(1000, 100000, replace = TRUE),
group = factor(rep(1:100, each = 1000)),
value = runif(100000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
> system.time(out1 <- do.call(rbind, lapply(split(DF3, DF3["group"]), `[`, 1, )))
user system elapsed
19.594 0.053 19.984
> system.time(out3 <- aggregate(DF3[,-2], DF3["group"], function (x) x[1]))
user system elapsed
12.419 0.141 12.788
I gave up doing them with a million rows. Far faster, believe it or not, is:
out2 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]), `[`, 1,)),
byrow = TRUE, nrow = (lev <- length(levels(DF3$group))))
colnames(out2) <- names(DF3)[-4]
rownames(out2) <- seq_len(lev)
out2 <- as.data.frame(out2)
out2$group <- factor(out2$group)
out2$idu <- factor(paste(out2$id, out2$group, sep = "_"),
levels = levels(DF3$idu))
The outputs are (effectively) the same:
> all.equal(out1, out2)
[1] TRUE
> all.equal(out1, out3[, c(2,1,3,4)])
[1] "Attributes: < Component 2: Modes: character, numeric >"
[2] "Attributes: < Component 2: target is character, current is numeric >"
(the difference between out1
(or out2
) and out3
(the aggregate()
version) is just in the rownames of the components.)
with a timing of:
user system elapsed
0.163 0.001 0.168
on the 100,000 row problem, and on this million row problem:
set.seed(12)
DF3 <- data.frame(id = sample(1000, 1000000, replace = TRUE),
group = factor(rep(1:1000, each = 1000)),
value = runif(1000000))
DF3 <- within(DF3, idu <- factor(paste(id, group, sep = "_")))
with a timing of
user system elapsed
11.916 0.000 11.925
Working with the matrix version (that produces out2
) is quicker doing the million rows that the other versions are at doing the 100,000 row problem. This just shows that working with matrices is very quick indeed, and the bottleneck in the my do.call()
version is rbind()
-ing the result together.
The million row problem timing was done with:
system.time({out4 <- matrix(unlist(lapply(split(DF3[, -4], DF3["group"]),
`[`, 1,)),
byrow = TRUE,
nrow = (lev <- length(levels(DF3$group))))
colnames(out4) <- names(DF3)[-4]
rownames(out4) <- seq_len(lev)
out4 <- as.data.frame(out4)
out4$group <- factor(out4$group)
out4$idu <- factor(paste(out4$id, out4$group, sep = "_"),
levels = levels(DF3$idu))})
Original
If your data are in DF
, say, then:
do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
will do what you want:
> do.call(rbind, lapply(with(DF, split(DF, group)), head, 1))
idu group
1 1 1
2 4 2
3 7 3
If the new data are in DF2
then we get:
> do.call(rbind, lapply(with(DF2, split(DF2, group)), head, 1))
id group idu value
1 1 1 1_1 34
2 4 2 4_2 6
3 1 3 1_3 34
But for speed, we probably want to subset instead of using head()
and we can gain a bit by not using with()
, eg:
do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), `[`, 1, ))))
user system elapsed
3.847 0.040 4.044
> system.time(replicate(1000, do.call(rbind, lapply(split(DF2, DF2$group), head, 1))))
user system elapsed
4.058 0.038 4.111
> system.time(replicate(1000, aggregate(DF2[,-2], DF2["group"], function (x) x[1])))
user system elapsed
3.902 0.042 4.106