While Ben Bolkers answer is comprehensive, I will explain other reasons to avoid apply
on data.frames.
apply
will convert your data.frame
to a matrix. This will create a copy (waste of time and memory), as well as perhaps causing unintended type conversions.
Given that you have 10 million rows of data, I would suggest you look at the data.table
package that will let you do things efficiently in terms of memory and time.
For example, using tracemem
x <- apply(d,1, hypot2)
tracemem[0x2f2f4410 -> 0x2f31b8b8]: as.matrix.data.frame as.matrix apply
This is even worse if you then assign to a column in d
d$x <- apply(d,1, hypot2)
tracemem[0x2f2f4410 -> 0x2ee71cb8]: as.matrix.data.frame as.matrix apply
tracemem[0x2f2f4410 -> 0x2fa9c878]:
tracemem[0x2fa9c878 -> 0x2fa9c3d8]: $<-.data.frame $<-
tracemem[0x2fa9c3d8 -> 0x2fa9c1b8]: $<-.data.frame $<-
4 copies! -- with 10 million rows, that will probably come and bite you at somepoint.
If we use with
, there is no copying
involved, if we assign to a vector
y <- with(d, sqrt(x^2 + y^2))
But there will be if we assign to a column in the data.frame d
d$y <- with(d, sqrt(x^2 + y^2))
tracemem[0x2fa9c1b8 -> 0x2faa00d8]:
tracemem[0x2faa00d8 -> 0x2faa0f48]: $<-.data.frame $<-
tracemem[0x2faa0f48 -> 0x2faa0d08]: $<-.data.frame $<-
Now, if you use data.table
and :=
to assign by reference (no copying)
library(data.table)
DT <- data.table(d)
tracemem(DT)
[1] "<0x2d67a9a0>"
DT[,y := sqrt(x^2 + y^2)]
No copies!
Perhaps I will be corrected here, but another memory issue to consider is that sqrt(x^2+y^2))
will create 4 temporary variables (internally) x^2
, y^2
, x^2 + y^2
and then sqrt(x^2 + y^2))
The following will be slower, but only involve two variables being created.
DT[, rowid := .I] # previous option: DT[, rowid := seq_len(nrow(DT))]
DT[, y2 := sqrt(x^2 + y^2), by = rowid]