0

I am trying to figure out what is the best way to loop through a data.frame, myData, grouping by two columns, c1 and c2. Specifically I want to loop through each unique combination of c1 and c2 and apply a certain customFunction to other columns in myData. This customFunction depends on someStatsFunction, which outputs a data.frame.

I would typically use the function plyr::ddply, but my real dataset has over 18 million rows, and not surprisingly this is taking too long. So I decided to change the approach to a pipeline using dplyr::group_by and dplyr::do. Although using dplyr speeds up the problem (see minimal example below), it still takes quite some time. I heard that the data.table framework can speed things a lot (see example here), but I have no idea how to use it. I was wondering if anyone would be able to translate my problem below using data.table so I can also benchmark it.

library(plyr)  
library(dplyr)  
library(rbenchmark)  

someStatsFunction  <-  function (x) {
    data.frame(name = 'something', mean = mean(x), sd = sd(x), statx = sqrt(mean(abs(x)))/sd(x)^2)
}

customFunction  <-  function (data) {
    if (!all(sort(data$time) == data$time)) {
        stop('Column \'time\' is not ordered')
    }
    someStatsFunction(data$response)
}

myData  <-  data.frame(c1 = rep(rep(1:50, each = 30), 10), c2 = rep(rep(1:30, 50), 10), response = rnorm(30 * 50 * 10), time = 1:(30 * 50 * 10))

benchmark('testPlyr' = {
            testPlyr   <-  plyr::ddply(myData, .(c1, c2), customFunction)
          },
          'testDplyr' = {
            testDplyr  <-  myData %>% dplyr::group_by(c1,c2) %>% dplyr::do(customFunction(.))
          },
          replications = 3,
          columns      = c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'))

Here's what I get for the output:

       test replications elapsed relative user.self sys.self
2 testDplyr            3   7.416     1.00     7.368    0.060
1  testPlyr            3   8.378     1.13     8.364    0.012

Thanks,
D

UPDATE after @minem's answer

First, I did some fixing with my example above because the code was not correct.

Second, I expanded my minimal reproducible example above to better reflect (slightly) my situation. The someStatsFunction may depend on multiple columns from the data.table, and crunch a bunch of numbers based on some non-trivial combination of stats derived from these multiple columns. I also increased the size of myData (so the example below takes longer now if compared to the original one). Anyhow, I think I managed to replicate the output I would get from plyr or dplyr. It runs faster with data.table, which is really cool (see benchmarking below). However, the code seems a bit clumsy:

library(plyr)  
library(dplyr)  
library(data.table)  
library(rbenchmark)  

someStatsFunction  <-  function (y, x) {
    x    <-  as.integer(x)
    mod  <-  coef(summary(lm(y ~ x)))
    data.frame(stats1  = 'something',
             intercept = mod[1],
             slope     = mod[2],
             meanx     = mean(x),
             statx     = sqrt(mean(abs(x)))/sd(y)^2)
}

customFunction  <-  function (data) {
    if (!all(sort(data$time) == data$time)) {
        stop('Column \'time\' is not ordered')
    }
    someStatsFunction(y = data$response, x = data$time)
}

myData  <-  data.frame(c1 = rep(rep(1:50, each = 30), 1095), c2 = rep(rep(1:30, 50), 1095), response = rnorm(30 * 50 * 1095), time = rep(seq(as.Date('1981-01-01'), as.Date('1983-12-31'), by = '1 day'), each = 50*30))

benchmark('testPlyr' = {
            testPlyr   <-  plyr::ddply(myData, .(c1, c2), customFunction)
        },
          'testDplyr' = {
            testDplyr  <-  myData %>% dplyr::group_by(c1,c2) %>% dplyr::do(customFunction(.))
        },
          'testDtb' = {
            vNames   <-  c('stats1', 'intercept', 'slope', 'meanx', 'statx')
            dt       <- as.data.table(myData)
            testDtb  <- dt[order(time)][, 
            (vNames) := as.list(someStatsFunction(response, time)), 
            by = .(c1, c2)][, 
            head(.SD, 1), by = .(c1, c2)][, 
            c('response', 'time') := NULL, ]
        },
    replications = 3,
    columns      = c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'))

Here's what I get for the output:

       test replications elapsed relative user.self sys.self
2 testDplyr            3  28.209    3.101    20.841    7.317
3   testDtb            3   9.098    1.000    10.958    0.385
1  testPlyr            3  28.224    3.102    21.741    7.167

So substantial improvement in speed. However, I had to first order the data before applying someStatsFunction (i.e. eliminating the need for an if statement at customFunction), to then run the function using the columns response and time in myData. Moreover, the raw output from

dt[order(time)][, (vNames) := as.list(someStatsFunction(response, time)), by = .(c1, c2)]

gives a table that does not return 1500 values (i.e. 30*50 combinations of c1 and c2), but instead repeats combinations of c1 and c2 multiple times. Also, it does return the original response and time columns, though I only wanted the unique combinations of c1 and c2 bound to the stats from someStatsFunction (as in the output using plyr and/or dplyr), hence my final code

testDtb  <- dt[order(time)][, 
(vNames) := as.list(someStatsFunction(response, time)), 
by = .(c1, c2)][, 
head(.SD, 1), by = .(c1, c2)][, 
c('response', 'time') := NULL, ]

Is there anyway I can achieve the same output but in a much more simplified way?

dbarneche
  • 645
  • 1
  • 5
  • 8
  • Your example data is too tiny for the timing to matter. Maybe write it as dependent on #rows and #groups. Also, I guess you can simplify it (no reason to have two grouping vars c1 and c2) – Frank Aug 09 '18 at 15:11
  • Thanks Frank, I used this simple example with c1 and c2 because in my original data they are coordinates (lat long). I basically need to apply a complex function to an environmental time series for each pair of coordinates. Does pasting the coordinates into one single grouping variable (for the purposes of the loop) and then splitting them back up in the output speed things up? Or would that just create additional unnecessary code? Plus, I'm interested in learning whether a `data.table` alternative would speed things up considerably... – dbarneche Aug 09 '18 at 20:24
  • I don't expect that combining them would improve speed; I just meant it's extra complexity that we might not need to address the question... I think (but could be wrong) that only #groups matters (calculable as `uniqueN(setDT(mydata), by=c("c1", "c2"))` after data.table is loaded). Btw, if you want to know which grouped calculations are optimized, you could look at `?GForce`. – Frank Aug 09 '18 at 20:30
  • Perfect, thanks for the advice. The data.table concept is still very new to me, so I still don't quite know how to translate the examples above into the data.table framework. But I'll try to edit my post with a data.table example if I get one to work. – dbarneche Aug 09 '18 at 22:13
  • 1
    something like `setDT(myData)[order(time), someStatsFunction(response, time), by=.(c1, c2)]` will achieve what you want – chinsoon12 Aug 14 '18 at 10:09
  • Thanks so much **@chinsoon12**, your suggestion did the trick. I added a full comparison among methods below for future reference. – dbarneche Aug 17 '18 at 12:05

2 Answers2

1

try:

dt <- as.data.table(myData)
rr <- dt[, .(
  lon = c1,
  lat = c2,
  name = 'something',
  mean = mean(response),
  sd = sd(response),
  statx = sqrt(abs(response)) / sd(response) ^ 2

), keyby = .(c1, c2)]
rr
#        c1 c2 lon lat      name        mean        sd     statx
#     1:  1  1   1   1 something  0.23841637 0.9384408 0.3253456
#     2:  1  1   1   1 something  0.23841637 0.9384408 0.2421654
#     3:  1  1   1   1 something  0.23841637 0.9384408 0.5321797
#     4:  1  1   1   1 something  0.23841637 0.9384408 0.4136648
#     5:  1  1   1   1 something  0.23841637 0.9384408 1.5863249
# ---                                                        
# 14996: 50 30  50  30 something -0.04082032 0.7156352 2.3970053
# 14997: 50 30  50  30 something -0.04082032 0.7156352 0.8375551
# 14998: 50 30  50  30 something -0.04082032 0.7156352 1.7826972
# 14999: 50 30  50  30 something -0.04082032 0.7156352 1.0293926
# 15000: 50 30  50  30 something -0.04082032 0.7156352 0.1376940
minem
  • 3,640
  • 2
  • 15
  • 29
  • Hi @minem, thanks for the reply. This is a nice solution, and definitely faster than the dplyr or plyr approaches, and spits out the data in the exact same format, which is great. However, your example doesn't allow me to 'send' the entire data as an argument to `customFunction` -- this is necessary because this function performs checks of data quality (the 'if' statement) before applying the nested function `someStatsFunction`. In reality, I have a much more complex wrapper function (like `customFunction`) that depends on a series of nested functions to spit out the output data.frame. – dbarneche Aug 10 '18 at 12:13
0

Thanks to the answer provided by @chinsoon12, I was able to get my desired outcome:

library(plyr)  
library(dplyr)  
library(data.table)  
library(rbenchmark)  

someStatsFunction  <-  function (y, x) {
    x    <-  as.integer(x)
    mod  <-  coef(summary(lm(y ~ x)))
    data.frame(stats1  = 'something',
             intercept = mod[1],
             slope     = mod[2],
             meanx     = mean(x),
             statx     = sqrt(mean(abs(x)))/sd(y)^2)
}

customFunction  <-  function (data) {
    if (!all(sort(data$time) == data$time)) {
        stop('Column \'time\' is not ordered')
    }
    someStatsFunction(y = data$response, x = data$time)
}

myData  <-  data.frame(c1 = rep(rep(1:50, each = 30), 1095), c2 = rep(rep(1:30, 50), 1095), response = rnorm(30 * 50 * 1095), time = rep(seq(as.Date('1981-01-01'), as.Date('1983-12-31'), by = '1 day'), each = 50*30))

benchmark('testPlyr' = {
            testPlyr   <-  plyr::ddply(myData, .(c1, c2), customFunction)
        },
          'testDplyr' = {
            testDplyr  <-  myData %>% dplyr::group_by(c1,c2) %>% dplyr::do(customFunction(.))
        },
          'testDtb' = {
            testDtb  <-  setDT(myData)[order(time), someStatsFunction(response, time), by=.(c1, c2)]
        },
    replications = 3,
    columns      = c('test', 'replications', 'elapsed', 'relative', 'user.self', 'sys.self'))

Here's what I get for the benchmarking:

       test replications elapsed relative user.self sys.self
2 testDplyr            3  68.383    3.976    48.120   20.392
3   testDtb            3  17.201    1.000    17.232    0.008
1  testPlyr            3  57.938    3.368    49.676    8.304

In case you're interested in knowing whether the results are identical among the different methods, check:

all.equal(testDplyr, testDtb)
# [1] TRUE
all.equal(testDplyr, testPlyr)
# [1] TRUE
dbarneche
  • 645
  • 1
  • 5
  • 8