17

I have a quite large data frame, about 10 millions of rows. It has columns x and y, and what I want is to compute

hypot <- function(x) {sqrt(x[1]^2 + x[2]^2)}

for each row. Using apply it would take a lot of time (about 5 minutes, interpolating from lower sizes) and memory.

But it seems to be too much for me, so I've tried different things:

  • compiling the hypot function reduces the time by about 10%
  • using functions from plyr greatly increases the running time.

What's the fastest way to do this thing?

Andrie
  • 176,377
  • 47
  • 447
  • 496
aplavin
  • 2,199
  • 5
  • 32
  • 53

3 Answers3

24

What about with(my_data,sqrt(x^2+y^2)) ?

set.seed(101)
d <- data.frame(x=runif(1e5),y=runif(1e5))

library(rbenchmark)

Two different per-line functions, one taking advantage of vectorization:

hypot <- function(x) sqrt(x[1]^2+x[2]^2)
hypot2 <- function(x) sqrt(sum(x^2))

Try compiling these too:

library(compiler)
chypot <- cmpfun(hypot)
chypot2 <- cmpfun(hypot2)

benchmark(sqrt(d[,1]^2+d[,2]^2),
          with(d,sqrt(x^2+y^2)),
          apply(d,1,hypot),
          apply(d,1,hypot2),
          apply(d,1,chypot),
          apply(d,1,chypot2),
          replications=50)

Results:

                       test replications elapsed relative user.self sys.self
5       apply(d, 1, chypot)           50  61.147  244.588    60.480    0.172
6      apply(d, 1, chypot2)           50  33.971  135.884    33.658    0.172
3        apply(d, 1, hypot)           50  63.920  255.680    63.308    0.364
4       apply(d, 1, hypot2)           50  36.657  146.628    36.218    0.260
1 sqrt(d[, 1]^2 + d[, 2]^2)           50   0.265    1.060     0.124    0.144
2  with(d, sqrt(x^2 + y^2))           50   0.250    1.000     0.100    0.144

As expected the with() solution and the column-indexing solution à la Tyler Rinker are essentially identical; hypot2 is twice as fast as the original hypot (but still about 150 times slower than the vectorized solutions). As already pointed out by the OP, compilation doesn't help very much.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
  • Thanks, that worked instantly! I'm new to R, still can't get used that all the operators are vectorized. – aplavin Dec 20 '12 at 19:27
  • 1
    Vectors are a beautiful thing :) – Ricardo Saporta Dec 20 '12 at 19:27
  • +1 for benchmarking! interesting that `[` is slower than `with`. `$` would get you even faster results! – Ricardo Saporta Dec 20 '12 at 19:30
  • 3
    @RicardoSaporta, I think that's just noise -- the timing difference is about 0.007 seconds ... – Ben Bolker Dec 20 '12 at 19:31
  • 2
    @BenBolker. I was curious so I ran this 100x 250reps: `with` and `$` were each faster ~45% of the time, `[` only about 10%. – Ricardo Saporta Dec 20 '12 at 19:46
  • 3
    If `m <- as.matrix(d)`, then `sqrt((m * m) %*% c(1, 1))` is competitive (probably ~1% faster, which means ~nothing). – Josh O'Brien Dec 20 '12 at 20:36
  • None of the benchmarking solutions assign the result! Are we trying to add a column to `d` or create a new vector in the parent environment? – mnel Dec 20 '12 at 22:25
  • @mnel: the OP doesn't seem to say (it just says "I want to compute ..."), so I guess it's left open to interpretation. – Ben Bolker Dec 20 '12 at 23:12
  • I hope they won't be computing to display the 10 million rows in the console window. Where they make the assignment will be important. – mnel Dec 20 '12 at 23:15
  • Hmm. Based on a quick trial `r <- numeric(1e8)` it takes about 4-8 seconds to allocate/assign a vector of the relevant size. If the example above scales linearly it will take about 250 seconds to do the computation; are we worrying prematurely about optimization? It's also not clear whether temporary memory utilization is a problem or not (`object.size` of the vector I made above is 760Mb). We've already gotten the OP a speedup of about 250x from their original solution ... it's not clear where the priority should be after that ... – Ben Bolker Dec 20 '12 at 23:25
  • I think the memory issues (conversion within apply) is important here too. – mnel Dec 21 '12 at 00:04
  • 1
    @chersanya: I laughed when I saw your first comment above, as after using R for a while now, I can't get used to that other languages aren't vectorized. Every time I need to now, I think to myself "really, I have to write this loop myself?" – Aaron left Stack Overflow Dec 21 '12 at 01:46
13

While Ben Bolkers answer is comprehensive, I will explain other reasons to avoid apply on data.frames.

apply will convert your data.frame to a matrix. This will create a copy (waste of time and memory), as well as perhaps causing unintended type conversions.

Given that you have 10 million rows of data, I would suggest you look at the data.table package that will let you do things efficiently in terms of memory and time.


For example, using tracemem

x <- apply(d,1, hypot2)
tracemem[0x2f2f4410 -> 0x2f31b8b8]: as.matrix.data.frame as.matrix apply 

This is even worse if you then assign to a column in d

d$x <- apply(d,1, hypot2)
tracemem[0x2f2f4410 -> 0x2ee71cb8]: as.matrix.data.frame as.matrix apply 
tracemem[0x2f2f4410 -> 0x2fa9c878]: 
tracemem[0x2fa9c878 -> 0x2fa9c3d8]: $<-.data.frame $<- 
tracemem[0x2fa9c3d8 -> 0x2fa9c1b8]: $<-.data.frame $<- 

4 copies! -- with 10 million rows, that will probably come and bite you at somepoint.

If we use with, there is no copying involved, if we assign to a vector

y <- with(d, sqrt(x^2 + y^2))

But there will be if we assign to a column in the data.frame d

d$y <- with(d, sqrt(x^2 + y^2))
tracemem[0x2fa9c1b8 -> 0x2faa00d8]: 
tracemem[0x2faa00d8 -> 0x2faa0f48]: $<-.data.frame $<- 
tracemem[0x2faa0f48 -> 0x2faa0d08]: $<-.data.frame $<- 

Now, if you use data.table and := to assign by reference (no copying)

 library(data.table)
 DT <- data.table(d)



tracemem(DT)
[1] "<0x2d67a9a0>"
DT[,y := sqrt(x^2 + y^2)]

No copies!


Perhaps I will be corrected here, but another memory issue to consider is that sqrt(x^2+y^2)) will create 4 temporary variables (internally) x^2, y^2, x^2 + y^2 and then sqrt(x^2 + y^2))

The following will be slower, but only involve two variables being created.

 DT[, rowid := .I] # previous option: DT[, rowid := seq_len(nrow(DT))]
 DT[, y2 := sqrt(x^2 + y^2), by = rowid]
Jaap
  • 81,064
  • 34
  • 182
  • 193
mnel
  • 113,303
  • 27
  • 265
  • 254
4

R is vectorised, so you could use the following, plugging in your own matrix of course

X = t(matrix(1:4, 2, 2))^2
>      [,1] [,2]
 [1,]    1    4
 [2,]    9   16

rowSums(X)^0.5

Nice and efficient :)

Róisín Grannell
  • 2,048
  • 1
  • 19
  • 31