I tend to be averse to using apply
on rows of a data.frame (as any mis-step upconverts everything to a character class). I've had to do something very similar to what you are asking in other code, and I opted for mapply
.
It does "something" with the first element of 2 (or more) vectors/lists, then does the same "something" with the second element of the same vectors/lists, etc. The "something" of course is defined by the first argument -- a function, similar to the other *apply
functions.
set.seed(42)
x1 <- c(1,10,30)
x2 <- c(11,31,40)
df <- as.data.frame(sample(40))
ret <- mapply(function(a,b) df[a:b,], x1, x2)
ret
## [[1]]
## [1] 37 40 11 31 24 19 26 5 22 32 14
## [[2]]
## [1] 32 14 21 27 7 13 36 25 3 38 12 35 23 18 17 2 8 6 29 30 10 15
## [[3]]
## [1] 10 15 39 4 33 1 28 34 9 16 20
From here it would be trivial to apply any other statistical summaries you want:
sapply(ret, function(x) c(mean=mean(x), sd=sd(x)))
## [,1] [,2] [,3]
## mean 23.72727 19.13636 19.00000
## sd 10.95528 11.14107 12.87633
(Or you could always extend the mapply
call to directly call these other functions.)
EDIT #1:
As suggested by @docendo discimus, Map
(and mapply
with SIMPLIFY=FALSE
) are slightly faster. For comparison:
set.seed(3)
x1 <- c(1,11,31)
x2 <- c(10,30,40)
df1 <- data.frame(V1 = sample(40))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
library(data.table)
library(dplyr)
library(microbenchmark)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: microseconds
## expr min lq mean median uq max neval
## dt 925.964 1006.9570 1176.5629 1081.4810 1184.7870 2582.434 100
## dplyr 1843.449 1967.0590 2154.9829 2042.2515 2185.2745 3839.960 100
## mapplyT 208.398 237.8500 272.8850 260.8315 286.2685 511.846 100
## mapplyF 187.424 208.6205 237.6805 225.1320 247.2215 445.801 100
## Map 191.441 215.7610 240.9025 231.6025 258.6005 441.785 100
I made explicit deep copies of the data.frame because setDT
modified the data.frame in place (ergo its efficiency) but mapply
and Map
were not able to cope with the data.table. (I baked the mean
,sd
,var
into my mapply
calls in order to compare apples with apples.)
EDIT #2:
The previous benchmarks look impressive and conclusive, but don't depict overhead of calls versus efficiency of large-data engines. Here's another run at things with more data.
When the individual subsets are fairly large -- i.e, fewer "chunks" from the source data.frame -- performance tends to balance out. Here I control chunk size with k
:
n <- 4000
k <- 100
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
## expr min lq mean median uq max neval
## dt 2.133063 2.297282 2.549046 2.435618 2.655842 4.305396 100
## dplyr 2.145558 2.401482 2.643981 2.552090 2.720102 4.374118 100
## mapplyT 2.599392 2.775883 3.135473 2.926045 3.156978 5.430832 100
## mapplyF 2.498540 2.738398 3.079050 2.882535 3.094057 7.041340 100
## Map 2.624382 2.725680 3.158272 2.894808 3.184869 6.533956 100
However, if the chunk size is reduced, the already-well-performing dplyr
comes out ahead by a good margin:
n <- 4000
k <- 10
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)
microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
## expr min lq mean median uq max neval
## dt 11.494443 12.45187 14.163123 13.716532 14.655883 62.424668 100
## dplyr 2.729696 3.05501 3.286876 3.148276 3.324098 4.832414 100
## mapplyT 25.195579 27.67426 28.488846 28.319758 29.247729 32.897811 100
## mapplyF 25.455742 27.42816 28.713237 28.038622 28.958785 76.587224 100
## Map 25.184870 27.32730 28.737281 28.198155 28.768237 77.830470 100
If you notice, dplyr
took roughly the same time for the smaller dataset as the larger. Nice.
There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) This applies equally well to benchmarks.