Suggestions for avoiding for-loops in R

Question

I am trying to avoid using for() loops for my problem. Let's say I have two vectors, for simplicities sake: x1 <- c(1,10,30) and x2 <- c(11,31,40). These vectors contain reference points that point to certain intervals in my df with variables that each have, in this case, 40 observations. So:
df(x1[1]:x2[1]) would be the first ten observations. df(x1[2]:x2[2]) would be next 20 observations with the last (30,40) representing the last 10. I want to calculate multiple statistics including mean,std and variance, for example, for each of the intervals. for()-loops will do the trick but its very slow. I was looking at apply functions, but i can't seem to figure it out. mean(df[x1:x2]) also doesn't do the trick as it just takes the first value for x1 and x2.

Any suggestions?

--tstev

@Colonel Beauvel, its just a simple arbitrary example to keep problem simple. But lets say the data frame looks like this if I were only interested in one variable: `df <- as.data.frame(sample(40))` — tstev, Feb 06 '15 at 15:30
Be aware that replacing for-loops with `apply`-type functions does not automatically speed things up (for instance, see http://stackoverflow.com/questions/5533246/why-is-apply-method-slower-than-a-for-loop-in-r). Vectorized solutions, on the other hand, probably will. — Jesse Anderson, Feb 06 '15 at 16:37

r2evans · Answer 1 · 2015-02-06T18:42:11.780

I tend to be averse to using apply on rows of a data.frame (as any mis-step upconverts everything to a character class). I've had to do something very similar to what you are asking in other code, and I opted for mapply.

It does "something" with the first element of 2 (or more) vectors/lists, then does the same "something" with the second element of the same vectors/lists, etc. The "something" of course is defined by the first argument -- a function, similar to the other *apply functions.

set.seed(42)
x1 <- c(1,10,30)
x2 <- c(11,31,40)
df <- as.data.frame(sample(40))
ret <- mapply(function(a,b) df[a:b,], x1, x2)
ret
## [[1]]
##  [1] 37 40 11 31 24 19 26  5 22 32 14
## [[2]]
##  [1] 32 14 21 27  7 13 36 25  3 38 12 35 23 18 17  2  8  6 29 30 10 15
## [[3]]
##  [1] 10 15 39  4 33  1 28 34  9 16 20

From here it would be trivial to apply any other statistical summaries you want:

sapply(ret, function(x) c(mean=mean(x), sd=sd(x)))
##          [,1]     [,2]     [,3]
## mean 23.72727 19.13636 19.00000
## sd   10.95528 11.14107 12.87633

(Or you could always extend the mapply call to directly call these other functions.)

EDIT #1:

As suggested by @docendo discimus, Map (and mapply with SIMPLIFY=FALSE) are slightly faster. For comparison:

set.seed(3)
x1 <- c(1,11,31)
x2 <- c(10,30,40)
df1 <- data.frame(V1 = sample(40))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

library(data.table)
library(dplyr)
library(microbenchmark)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: microseconds
##     expr      min        lq      mean    median        uq      max neval
##       dt  925.964 1006.9570 1176.5629 1081.4810 1184.7870 2582.434   100
##    dplyr 1843.449 1967.0590 2154.9829 2042.2515 2185.2745 3839.960   100
##  mapplyT  208.398  237.8500  272.8850  260.8315  286.2685  511.846   100
##  mapplyF  187.424  208.6205  237.6805  225.1320  247.2215  445.801   100
##      Map  191.441  215.7610  240.9025  231.6025  258.6005  441.785   100

I made explicit deep copies of the data.frame because setDT modified the data.frame in place (ergo its efficiency) but mapply and Map were not able to cope with the data.table. (I baked the mean,sd,var into my mapply calls in order to compare apples with apples.)

EDIT #2:

The previous benchmarks look impressive and conclusive, but don't depict overhead of calls versus efficiency of large-data engines. Here's another run at things with more data.

When the individual subsets are fairly large -- i.e, fewer "chunks" from the source data.frame -- performance tends to balance out. Here I control chunk size with k:

n <- 4000
k <- 100
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
##     expr      min       lq     mean   median       uq      max neval
##       dt 2.133063 2.297282 2.549046 2.435618 2.655842 4.305396   100
##    dplyr 2.145558 2.401482 2.643981 2.552090 2.720102 4.374118   100
##  mapplyT 2.599392 2.775883 3.135473 2.926045 3.156978 5.430832   100
##  mapplyF 2.498540 2.738398 3.079050 2.882535 3.094057 7.041340   100
##      Map 2.624382 2.725680 3.158272 2.894808 3.184869 6.533956   100

However, if the chunk size is reduced, the already-well-performing dplyr comes out ahead by a good margin:

n <- 4000
k <- 10
x1 <- c(1, sort(sample(n, size = n/k - 1)))
x2 <- c(x1[-1] - 1, n)
df1 <- data.frame(V1 = sample(n))
df2 <- df1[,,drop = FALSE]
df3 <- df1[,,drop = FALSE]
grp <- rep(seq_along(x1), (x2-x1) + 1L)
df2 <- cbind(df2, grp)

microbenchmark(dt=setDT(df1)[, list(mean(V1), sd(V1), var(V1)), by = grp],
               dplyr=df2 %>% group_by(grp) %>% summarise_each(funs(mean, sd, var)),
               mapplyT=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=TRUE),
               mapplyF=mapply(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2, SIMPLIFY=FALSE),
               Map=Map(function(a,b) { x <- df3[a:b,]; c(mean(x), sd(x), var(x)); }, x1, x2))
## Unit: milliseconds
##     expr       min       lq      mean    median        uq       max neval
##       dt 11.494443 12.45187 14.163123 13.716532 14.655883 62.424668   100
##    dplyr  2.729696  3.05501  3.286876  3.148276  3.324098  4.832414   100
##  mapplyT 25.195579 27.67426 28.488846 28.319758 29.247729 32.897811   100
##  mapplyF 25.455742 27.42816 28.713237 28.038622 28.958785 76.587224   100
##      Map 25.184870 27.32730 28.737281 28.198155 28.768237 77.830470   100

If you notice, dplyr took roughly the same time for the smaller dataset as the larger. Nice.

There are three kinds of lies: lies, damned lies, and statistics. (Benjamin Disraeli) This applies equally well to benchmarks.

I think it could be slightly faster using `Map` instead of `mapply`, or specifying `simplify = FALSE` inside `mapply` (since you want a list in return anyway). — talat, Feb 06 '15 at 17:56
By my benchmarks, `SIMPLIFY=TRUE` reduces the time by around 18% on large data.frames, you're absolutely right. See my edit. — r2evans, Feb 06 '15 at 18:06
Nice edit! After seeing the benchmarks I deleted my answer :) — talat, Feb 06 '15 at 18:31
Be careful with benchmarks; this just indicates how things perform with very small datasets. Things look a bit different when the source data.frame and vectors are larger. Another edit forthcoming. — r2evans, Feb 06 '15 at 18:35
True, that was my second thought as well. On the other hand, the original example had overlapping intervals which is different from what I did in my answer. — talat, Feb 06 '15 at 18:41
Great explanations! Thanks a lot! I am going to take some time to understand this and apply it to my data. — tstev, Feb 09 '15 at 12:22

Colonel Beauvel · Accepted Answer · 2015-02-06T16:11:44.063

1

Good opportunity to use Map with the usefull each from plyr package:

library(plyr)

Map(function(u,v) each(mean, sd, var)(df[u:v,1]), x1, x2)

#[[1]]
#    mean        sd       var
#17.90000  10.15929 103.21111  

#[[2]]
#    mean        sd       var
#19.14286  12.18313 148.42857

#[[3]]
#    mean        sd       var 
#24.81818  10.78720 116.36364

Data:

x1 <- c(1,10,30)
x2 <- c(10,30,40)
set.seed(3)
df <- as.data.frame(sample(40))

edited Feb 06 '15 at 16:11

answered Feb 06 '15 at 15:57

Colonel Beauvel

30,423
11
47
87

thanks a lot for the suggestion! I wish I could select two answers. I am going to go with @Benoit solution because its easier for me to understand. I am not sure whats happening with the `Map` function. The documentation is lacking I find. But that probably due to the fact that I just started with R. :) Thanks a lot in any case! – tstev Feb 06 '15 at 16:13
1

Map takes the two vectors x1 and x2. Then apply the function on x1[1] and x2[1]. Then moves up and applies the function on x1[2] and x2[2], and so on. – Colonel Beauvel Feb 06 '15 at 16:15
Ah I see now. Tested it with my actual dataset and its great actually! With `Map()` I wouldn't need separate statements for each statistic I would like. – tstev Feb 06 '15 at 16:27
1

Exactly using Map and each is great functional programming, very compact! – Colonel Beauvel Feb 06 '15 at 16:30
Very true! It's also faster than `apply` when testing with `microbenchmark` package. Considering I will be recieving a significantly larger dataset as soon as my scripts are fool-proof, `Map()` will prove to be very useful :D – tstev Feb 06 '15 at 16:36

score 1 · Answer 3 · answered Feb 06 '15 at 16:03

1

Here is a solution to your problem:

x1 <- c(1,10,30)
x2 <- c(10,30,40)

df <- as.data.frame(sample(40))
df2 <- data.frame(x1,x2)

apply(df2,1, function(x) mean(df[x[1]:x[2],]))

Just replace mean() by sd() or var() to get standard deviation or variance. Don't forget the na.rm=TRUE argument if you have missing data in df.

answered Feb 06 '15 at 16:03

Benoit

1,154
8
11

Thanks alot! I assume this would also be possible with matrices. From what I understand R works 'faster' with matrices or lists and not so much data.frames. – tstev Feb 06 '15 at 16:16

score 1 · Answer 4 · answered Feb 06 '15 at 16:33

Maybe instead of a for loop you could use apply twice? The desired computation can be wrapped into a function (in my example it is compute_mean), and then one can call this function on pairs of indexes from x1 and x2. Given that x1 and x2 are of the same length, it is easy to do with lapply

x1 <- c(1,10,30)
x2 <- c(10,30,40)
df <- as.data.frame(sample(40))

compute_mean <- function(df, ind1, ind2, i){
    result <- apply( df[c(ind1[i]:ind2[i]), , drop = F], 2, mean )
    return(result)
}

unlist(lapply(c(1:length(x1)), function(x){
    out <- compute_mean(df = df, ind1 = x1, ind2 = x2, i = x)
    return(out)
}))

Suggestions for avoiding for-loops in R

4 Answers4