I am trying to figure out what is the best way to loop through a data.frame, myData
, grouping by two columns, c1
and c2
.
Specifically I want to loop through each unique combination of c1
and c2
and apply a certain customFunction
to other columns in myData
. This customFunction
depends on someStatsFunction
, which outputs a data.frame
.
I would typically use the function plyr::ddply
, but my real dataset has over 18 million rows, and not surprisingly this is taking too long. So I decided to change the approach to a pipeline using dplyr::group_by
and dplyr::do
. Although using dplyr
speeds up the problem (see minimal example below), it still takes quite some time. I heard that the data.table
framework can speed things a lot (see example here), but I have no idea how to use it. I was wondering if anyone would be able to translate my problem below using data.table
so I can also benchmark it.
library(plyr)
library(dplyr)
library(rbenchmark)
someStatsFunction <- function (x) {
data.frame(name = 'something', mean = mean(x), sd = sd(x), statx = sqrt(mean(abs(x)))/sd(x)^2)
}
customFunction <- function (data) {
if (!all(sort(data$time) == data$time)) {
stop('Column \'time\' is not ordered')
}
someStatsFunction(data$response)
}
myData <- data.frame(c1 = rep(rep(1:50, each = 30), 10), c2 = rep(rep(1:30, 50), 10), response = rnorm(30 * 50 * 10), time = 1:(30 * 50 * 10))
benchmark('testPlyr' = {
testPlyr <- plyr::ddply(myData, .(c1, c2), customFunction)
},
'testDplyr' = {
testDplyr <- myData %>% dplyr::group_by(c1,c2) %>% dplyr::do(customFunction(.))
},
replications = 3,
columns = c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'))
Here's what I get for the output:
test replications elapsed relative user.self sys.self
2 testDplyr 3 7.416 1.00 7.368 0.060
1 testPlyr 3 8.378 1.13 8.364 0.012
Thanks,
D
UPDATE after @minem's answer
First, I did some fixing with my example above because the code was not correct.
Second, I expanded my minimal reproducible example above to better reflect (slightly) my situation. The someStatsFunction
may depend on multiple columns from the data.table, and crunch a bunch of numbers based on some non-trivial combination of stats derived from these multiple columns. I also increased the size of myData
(so the example below takes longer now if compared to the original one). Anyhow, I think I managed to replicate the output I would get from plyr
or dplyr
. It runs faster with data.table, which is really cool (see benchmarking below). However, the code seems a bit clumsy:
library(plyr)
library(dplyr)
library(data.table)
library(rbenchmark)
someStatsFunction <- function (y, x) {
x <- as.integer(x)
mod <- coef(summary(lm(y ~ x)))
data.frame(stats1 = 'something',
intercept = mod[1],
slope = mod[2],
meanx = mean(x),
statx = sqrt(mean(abs(x)))/sd(y)^2)
}
customFunction <- function (data) {
if (!all(sort(data$time) == data$time)) {
stop('Column \'time\' is not ordered')
}
someStatsFunction(y = data$response, x = data$time)
}
myData <- data.frame(c1 = rep(rep(1:50, each = 30), 1095), c2 = rep(rep(1:30, 50), 1095), response = rnorm(30 * 50 * 1095), time = rep(seq(as.Date('1981-01-01'), as.Date('1983-12-31'), by = '1 day'), each = 50*30))
benchmark('testPlyr' = {
testPlyr <- plyr::ddply(myData, .(c1, c2), customFunction)
},
'testDplyr' = {
testDplyr <- myData %>% dplyr::group_by(c1,c2) %>% dplyr::do(customFunction(.))
},
'testDtb' = {
vNames <- c('stats1', 'intercept', 'slope', 'meanx', 'statx')
dt <- as.data.table(myData)
testDtb <- dt[order(time)][,
(vNames) := as.list(someStatsFunction(response, time)),
by = .(c1, c2)][,
head(.SD, 1), by = .(c1, c2)][,
c('response', 'time') := NULL, ]
},
replications = 3,
columns = c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'))
Here's what I get for the output:
test replications elapsed relative user.self sys.self
2 testDplyr 3 28.209 3.101 20.841 7.317
3 testDtb 3 9.098 1.000 10.958 0.385
1 testPlyr 3 28.224 3.102 21.741 7.167
So substantial improvement in speed. However, I had to first order the data before applying someStatsFunction
(i.e. eliminating the need for an if
statement at customFunction
), to then run the function using the columns response
and time
in myData
. Moreover, the raw output from
dt[order(time)][, (vNames) := as.list(someStatsFunction(response, time)), by = .(c1, c2)]
gives a table that does not return 1500 values (i.e. 30*50 combinations of c1
and c2
), but instead repeats combinations of c1
and c2
multiple times. Also, it does return the original response
and time
columns, though I only wanted the unique combinations of c1
and c2
bound to the stats from someStatsFunction
(as in the output using plyr
and/or dplyr
), hence my final code
testDtb <- dt[order(time)][,
(vNames) := as.list(someStatsFunction(response, time)),
by = .(c1, c2)][,
head(.SD, 1), by = .(c1, c2)][,
c('response', 'time') := NULL, ]
Is there anyway I can achieve the same output but in a much more simplified way?