0

Working with some big data.frames in R, and wanted to know which one of the 2 options is more efficient timewise.

df[which(condition), ] = value

or

df[condition, ] = value

Assuming that most of the data doesn't fulfill the condition, and length(which(condition)) is much much smaller than the boolean vector.

Is it more efficient to ask for specific indices than going through the whole data.frame/vector and for each row/element and choose it if the boolean vector is true at the position.

Or maybe if I call another function, it only delays performance.

I assumed someone else already asked this, but could not find an answer, this seems relevent, but the discussions I saw there are only if you need the boolean vector/indices again.

10 Rep
  • 2,217
  • 7
  • 19
  • 33
T.G.
  • 743
  • 3
  • 6
  • 27
  • 4
    why don't you just run a benchmark with some of your real data? Look at `microbenchmark` package for example – talat Nov 30 '16 at 13:48
  • 3
    The [NEWS file](https://stat.ethz.ch/pipermail/r-announce/2016/000602.html) from R 3.3.0 says: "Thanks to a patch from Tomas Kalibera, x[x != 0] is now typically faster than x[which(x != 0)] (in the case where x has no NAs, the two are equivalent)." – Roland Nov 30 '16 at 14:08
  • @docendodiscimus, in my experience, checking like this is not very informative, sometimes I run exactly the same command and the time it takes is very different. I tried to check with `system.time()`, is `microbenchmark` more accurate? – T.G. Nov 30 '16 at 14:15
  • Thanks! @Roland do you have any idea if the fact that most of the boolean vector is FALSE has any effect? – T.G. Nov 30 '16 at 14:15
  • 2
    @T.G., yes, `microbenchmark` is more accurate because it runs the benchmark multiple times (100 times by default) and reports a summary – talat Nov 30 '16 at 14:16
  • @docendodiscimus Thanks! I'll try it. – T.G. Nov 30 '16 at 14:17
  • 1
    I don't think that the comparative efficiency between the two alternatives makes a big difference since, here, your main overhead comes from `"[<-.data.frame"` which accepts its "i" argument (the "condition" in your example) and uses it to subset the sequence of rows of the "data.frame" (search for `iseq <- seq_len(nrows)[i]` inside its body). So, I guess the question is (1) what is the difference in subsetting an integer vector with a logical (of same length) or an integer of indices and (2) how much does affect the overall `"[<-.data.frame"` call. – alexis_laz Nov 30 '16 at 14:59

0 Answers0