4

We have a very large data frame df that can be split by factors. On each subset of the data frame created by this split, we need to perform an operation to increase the number of rows of that subset until it's a certain length. Afterwards, we rbind the subsets to get a bigger version of df.

Is there a way of doing this quickly without using an inner function?

Let's say our subset operation (in a separate .R file) is:

foo <- function(df) { magic }

We've come up with a few ways of doing this:

1)

df <- split(df, factor)
df <- lapply(df, foo)
rbindlist(df)

2)

assign('list.df', list(), envir=.GlobalEnv) 
assign('i', 1, envir=.GlobalEnv)

dplyr::group_by(df, factor)
dplyr::mutate(df, foo.list(df.col))
df <- rbindlist(list.df)
rm('list.df', envir=.GlobalEnv)
rm('i', envir=.GlobalEnv)

(In a separate file)
foo.list <- function(df.cols) {
    magic; 
    list.df[[i]] <<- magic.df
    i <<- i + 1
    return(dummy)
}

The issue with the first approach is time. The lapply simply takes too long to really be desirable (on the order of an hour with our data set).

The issue with the second approach is the extremely undesirable side-effect of tampering with the user's global environment. It's significantly faster, but this is something we'd rather avoid if we can.

We've also tried passing in the list and count variables and then trying to substitute them with the variables in the parent environment (A sort of hack to get around R's lack of pass-by-reference).

We've looked at a number of possibly-relevant SO questions (R applying a function to a subset of a data frame, Calculations on subsets of a data frame, R: Pass by reference, e.t.c.) but none of them dealt with our question too well.

If you want to run code, here's something you can copy and paste:

 x <- runif(n=10, min=0, max=3)
 y <- sample(x=10, replace=FALSE)
 factors <- runif(n=10, min=0, max=2)
 factors <- floor(factors)
 df <- data.frame(factors, x, y)

df now looks like this (length 10): Original df

 ## We group by factor, then run foo on the groups.

 foo <- function(df.subset) {
   min <- min(df.subset$y)
   max <- max(df.subset$y)

   ## We fill out df.subset to have everything between the min and
   ## max values of y. Then we assign the old values of df.subset
   ## to the corresponding spots.

   df.fill <- data.frame(x=rep(0, max-min+1),
                         y=min:max,
                         factors=rep(df.subset$factors[1], max-min+1))
   df.fill$x[which(df.subset$y %in%(min:max))] <- df.subset$x
   df.fill
 }

So I can take my sample code in the first approach to build a new df (length 18): New df

Community
  • 1
  • 1
Ryan K.
  • 745
  • 1
  • 6
  • 11
  • You may want to look at `expand.grid` to fill out all values. – Gopala Jan 28 '16 at 14:54
  • @Henrik, thanks for the advice! I'll make adjustments accordingly. – Ryan K. Jan 28 '16 at 15:03
  • For expand.grid, we definitely considered it. We had trouble getting it to work with the groupings though. As every group has a different min and max we need to fill it to. – Ryan K. Jan 28 '16 at 15:07
  • @Henrik, sorry for not being clear. Foo is (we believe) as fast as reasonably possible. Our main concern is just applying foo to subgroups of ``df`` and then re-aggregating. Thanks though! I've improved the question based on your comments. – Ryan K. Jan 28 '16 at 15:19

4 Answers4

5

Using data.table this doesn't take long due to speedy functionality. If you can, rewrite your function to work with specific variables. The split-apply-combine processing may get a performance boost:

library(data.table)
system.time(
df2 <- setDT(df)[,foo(df), factors]
)
#   user  system elapsed 
#   1.63    0.39    2.03
Pierre L
  • 28,203
  • 6
  • 47
  • 69
3

Another variation using data.table.. First get the min(y):max(y) part and then join+update:

require(data.table)
ans = setDT(df)[, .(x=0, y=min(y):max(y)), by=factors
              ][df, x := i.x, on=c("factors", "y")][]
ans
#     factors          x  y
#  1:       0 1.25104362  1
#  2:       0 0.16729068  2
#  3:       0 0.00000000  3
#  4:       0 0.02533907  4
#  5:       0 0.00000000  5
#  6:       0 0.00000000  6
#  7:       0 1.80547980  7
#  8:       1 0.34043937  3
#  9:       1 0.00000000  4
# 10:       1 1.51742163  5
# 11:       1 0.15709287  6
# 12:       1 0.00000000  7
# 13:       1 1.26282241  8
# 14:       1 2.88292354  9
# 15:       1 1.78573288 10
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    The `range` function may(?) be faster than `min` and `max`. So having `seq2 = function(x) seq(x[1], x[2])` we can use `j = .(x=0, y=seq2(range(y)))`. – jangorecki Jan 28 '16 at 20:01
  • @jangorecki, maybe.. not sure. Good to know though, thanks. – Arun Jan 29 '16 at 17:24
2

Pierre and Roland already provides nice solutions.
If the case is scalability not only in timing but also in memory you can spread the data across number of remote R instances.
In most basic setup it requires only Rserve/RSclient, so no non-CRAN deps.

Spread data across R instances

For easier reproducibility below example will start two R instances on a single localhost machine. You need to start Rserve nodes on remote machines for real scalability.

# start R nodes
library(Rserve)
port = 6311:6312
invisible(sapply(port, function(port) Rserve(debug = FALSE, port = port, args = c("--no-save"))))

# populate data
set.seed(123)
x = runif(n=5e6,min=0, max=3)
y = sample(x=5e6,replace=FALSE)
factors = runif(n=5e6, min=0, max=2)
factors = floor(factors)
df = data.frame(factors, x, y)

# connect Rserve nodes
library(RSclient)
rscl = sapply(port, function(port) RS.connect(port = port))

# assign chunks to R nodes
sapply(seq_along(rscl), function(i) RS.assign(rscl[[i]], name = "x", value = df[df$factors == (i-1),]))

# assign magic function to R nodes
foo = function(df) df
sapply(rscl, RS.assign, name = "foo", value = foo)

All processes on remote machines can be performed parallely (using wait=FALSE and RS.collect) which additionally reduce computing timing.


Using lapply + RS.eval

# sequentially
l = lapply(rscl, RS.eval, foo(x))
rbindlist(l)

# parallely
invisible(sapply(rscl, RS.eval, foo(x), wait=FALSE))
l = lapply(rscl, RS.collect)
rbindlist(l)

Using big.data.table::rscl.*

big.data.table package provides few wrappers on RSclient::RS.* functions allowing them to accept list of connections to R nodes.
They doesn't use data.table in any way so can be effectively applied to data.frame, vector or any R type that is chunk-able. Below example uses basic data.frame.

library(big.data.table)

# sequentially
l = rscl.eval(rscl, foo(x), simplify=FALSE)
rbindlist(l)

# parallely
invisible(rscl.eval(rscl, foo(x), wait=FALSE))
l = rscl.collect(rscl, simplify=FALSE)
rbindlist(l)

Using big.data.table

This example requires data on nodes to be stored as data.tables, but gives some convenient api and a lot of other features.

library(big.data.table)
rscl.require(rscl, "data.table")
rscl.eval(rscl, is.data.table(setDT(x))) # is.data.table to suppress collection of `setDT` results

bdt = big.data.table(rscl = rscl)
# parallely by default
bdt[, foo(.SD), factors]
# considering we have data partitioned using `factors` field, the `by` is redundant in that case
bdt[, foo(.SD)]
# optionally use `[[` to access R nodes environment directly
bdt[[expr = foo(x)]]

Clean workspace

# disconnect
rscl.close(rscl)

# shutdown nodes started from R
l = lapply(setNames(nm = port), function(port) tryCatch(RSconnect(port = port), error = function(e) e, warning = function(w) w))
invisible(lapply(l, function(rsc) if(inherits(rsc, "sockconn")) RSshutdown(rsc)))
jangorecki
  • 16,384
  • 4
  • 79
  • 160
1

I don't think your function works as intended. It relies on y being ordered.

Try using a data.table join with grouping:

library(data.table)
setDT(df)
df2 <- df[, .SD[data.table(y=seq(.SD[, min(y)], .SD[, max(y)], by = 1)), .SD, 
                  on = "y"], #data.table join
                    by = factors] #grouping
df2[is.na(x), x:= 0]
setkey(df2, factors, y, x)
Roland
  • 127,288
  • 10
  • 191
  • 288
  • I believe the example was just a small mock-up of the function and not the function itself. – Pierre L Jan 28 '16 at 15:22
  • 2
    @PierreLafortune Maximum performance can only be achieved with a tailored solution. It doesn't matter much how the split-apply-combine is done if the function is slow. – Roland Jan 28 '16 at 15:25
  • 1
    I agree with that. The function to be applied should be rewritten – Pierre L Jan 28 '16 at 15:31
  • 1
    @Henrik Looking at OP's code, I don't believe their claims regarding efficiency of their functions. – Roland Jan 28 '16 at 15:33