15

Goals

I want to use dplyr to run simulations on grids of parameters. Specifically, I'd like a function that I can use in another program that

  • gets passed a data.frame
  • for every row calculates some simulation using each column as an argument
  • also is passed some extra data (e.g., initial conditions)

Here's my approach

require(dplyr)
run <- function(data, fun, fixed_parameters, ...) {
   ## ....
   ## argument checking
   ##

   fixed_parameters <- as.environment(fixed_parameters)
   grouped_out <- do_(rowwise(data), ~ do.call(fun, c(., fixed_parameters, ...)))
   ungroup(grouped_out)
 }

This works. For example, for

growth <- function(n, r, K, b) {
  # some dynamical simulation
  # this is an obviously-inefficient way to do this ;)
  n  + r - exp(n) / K - b - rnorm(1, 0, 0.1)
}
growth_runner <- function(r, K, b, ic, ...) {
  # a wrapper to run the simulation with some fixed values
  n0 = ic$N0
  T = ic$T
  reps = ic$reps
  data.frame(n_final = replicate(reps, {for(t in 1:T) {
                                          n0 <- growth(n0, r, K, b)
                                        };
                                        n0})
  )
}

I can define and run,

   data <- expand.grid(b = seq(0.01, 0.5, length.out=10),
                       K = exp(seq(0.1, 5, length.out=10)),
                       r = seq(0.5, 3.5, length.out=10))
   initial_data = list(N0=0.9, T=5, reps=20)
   output <- run(data, growth_runner, initial_data)

Question

Even though this seems to work, I wonder if there's a way to do it without do.call. (In part because of issues with do.call.)

I really am interested in a way to replace the line grouped_out <- do_(rowwise(data), ~ do.call(fun, c(., fixed_parameters, ...))) with something that does the same thing but without do.call. Edit: An approach that somehow avoids the performance penalties of using do.call outlined at the above link would also work.

Notes and References

jaimedash
  • 2,683
  • 17
  • 30

3 Answers3

5

I found it a little tricky to follow your code, but I think this is equivalent.

First I define a function that does the computation you're interested in:

growth_t <- function(n0, r, K, b, T) {
  n <- n0

  for (t in 1:T) {
    n <- n + r - exp(n) / K - b - rnorm(1, 0, 0.1)
  }
  n
}

Then I define the data that you want to vary, including a "dummy" variable for reps:

data <- expand.grid(
  b = seq(0.01, 0.5, length.out = 5),
  K = exp(seq(0.1, 5, length.out = 5)),
  r = seq(0.5, 3.5, length.out = 5),
  rep = 1:20
)

Then I can feed it into purrr::pmap_d(). pmap_d() does a "parallel" map - i.e. it takes a list (or data frame) as input, and calls the function varying all the named arguments for each iteration. The fixed parameters are supplied after the function name.

library(purrr)
data$output <- pmap_dbl(data[1:3], growth_t, n0 = 0.9, T = 5)

This really doesn't feel like a dplyr problem to me, because it's not really about data manipulation.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • 1
    thanks! fair point re dplyr, It started with `dplyr::do`. but given expanded tooling for tidy data, and especially the direction you're heading with `purrr` (eg, http://stackoverflow.com/q/35505187/4598520 ), I agree it's probably better described as tidy data problem – jaimedash May 25 '16 at 17:54
1

The below avoids using do.call and presents the output in the same way as the OP.

First, replace the parameters of the function with a vector that you'll pass in - this is what you'll pass through using apply.

growth_runner <- function(data.in, ic, ...) {
  # a wrapper to run the simulation with some fixed values
  n0 = ic$N0
  T = ic$T
  reps = ic$reps
  data.frame(n_final = replicate(reps, {for(t in 1:T) {
    n0 <- growth(n0, data.in[3], data.in[2], data.in[1])
  };
    n0})
  )
}

Set your grid you want to search over, just as you did before.

data <- expand.grid(b = seq(0.01, 0.5, length.out=10),
                    K = exp(seq(0.1, 5, length.out=10)),
                    r = seq(0.5, 3.5, length.out=10))
initial_data = list(N0=0.9, T=5, reps=20)

Use apply to go through your grid, then append the results

output.mid = apply(data, 1, ic=initial_data, FUN=growth_runner)
output <- data.frame('n_final'=unlist(output.mid))

And you have your output without any calls to do.call or any external library.

> dim(output)
[1] 20000     1
> head(output)
     n_final
1 -0.6375070
2 -0.7617193
3 -0.3266347
4 -0.7921655
5 -0.5874983
6 -0.4083613
Tchotchke
  • 3,061
  • 3
  • 22
  • 37
  • Sorry, you're missing critical context of the question: using dplyr. (First line of the question). The edit of 5/19 makes this clear. This is useful code though to accomplish the same overall task in a less generic way. Thanks! – jaimedash May 24 '16 at 17:37
  • Also note that `apply()` will fail as soon as you have non-numeric parameters – hadley May 25 '16 at 17:32
0

You can replace the line with do.call with the following (Thanks to @shorpy for pointing out purrr:invoke_rows()):

  grouped_out <- purrr::invoke_rows(fun, dplyr::rowwise(data), fixed_parameters)

without any other changes, this will give a data frame with a column of data.frames, like

Source: local data frame [1,000 x 4]
            b        K     r                .out
        (dbl)    (dbl) (dbl)               (chr)
1  0.01000000 1.105171   0.5 <data.frame [20,1]>
2  0.06444444 1.105171   0.5 <data.frame [20,1]>
3  0.11888889 1.105171   0.5 <data.frame [20,1]>

To recover something closer to the original behavior, replace the final line of run with

dplyr::ungroup(tidyr::unnest(grouped_out, .out))

which gives

Source: local data frame [20,000 x 4]

       b        K     r    n_final
   (dbl)    (dbl) (dbl)      (dbl)
1   0.01 1.105171   0.5 -0.6745470
2   0.01 1.105171   0.5 -0.7500365
3   0.01 1.105171   0.5 -0.6568312

No other changes to the code are needed :)

jaimedash
  • 2,683
  • 17
  • 30