Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

Question

I am trying to get ddply to run in parallel on my mac. The code I've used is as follows:

library(doMC)
library(ggplot2) # for the purposes of getting the baseball data.frame
registerDoMC(2)


> system.time(ddply(baseball, .(year), numcolwise(mean)))
   user  system elapsed 
  0.959   0.106   1.522 
> system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
   user  system elapsed 
  2.221   2.790   2.552

Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried registerDoMC() and the results were the same.

Essential reading: http://stackoverflow.com/questions/5588914/domc-vs-dosnow-vs-dosmp-vs-dompi-why-arent-the-various-parallel-backends-for-f/5621388#5621388 — Andrie, Aug 24 '11 at 20:07

score 11 · Accepted Answer · answered Aug 24 '11 at 19:30

The baseball data may be too small to see improvement by making the computations parallel; the overhead of passing the data to the different processes may be swamping any speedup by doing the calculations in parallel. Using the rbenchmark package:

baseball10 <- baseball[rep(seq(length=nrow(baseball)), 10),]

benchmark(noparallel = ddply(baseball, .(year), numcolwise(mean)),
    parallel = ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE),
    noparallel10 = ddply(baseball10, .(year), numcolwise(mean)),
    parallel10 = ddply(baseball10, .(year), numcolwise(mean), .parallel=TRUE),
    replications = 10)

gives results

          test replications elapsed relative user.self sys.self user.child sys.child
1   noparallel           10   4.562 1.000000     4.145    0.408      0.000     0.000
3 noparallel10           10  14.134 3.098203     9.815    4.242      0.000     0.000
2     parallel           10  11.927 2.614423     2.394    1.107      4.836     6.891
4   parallel10           10  18.406 4.034634     4.045    2.580     10.210     9.769

With a 10 times bigger data set, the penalty for parallel is smaller. A more complicated computation would also tilt it even further in parallel's favor, likely giving it an advantage.

This was run on a Mac OS X 10.5.8 Core 2 Duo machine.

Joshua Ulrich · Answer 2 · 2011-08-24T20:02:48.290

Running in parallel will be slower than running sequentially when the communication costs between the nodes is greater than the calculation time of the function. In other words, it takes longer to send the data to/from the nodes than it does to perform the calculation.

For the same data set, the communication costs are approximately fixed, so parallel processing is going to be more useful as the time spent evaluating the function increases.

UPDATE:
The code below shows 0.14 seconds (on my machine) are spent is spent evaluating .fun. That means communication has to be less than 0.07 seconds and that's not realistic for a data set the size of baseball.

Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean)))
#    user  system elapsed 
#    0.28    0.02    0.30
Rprof(NULL)
summaryRprof()$by.self
#               self.time self.pct total.time total.pct
# [.data.frame       0.04    12.50       0.10     31.25
# unlist             0.04    12.50       0.10     31.25
# match              0.04    12.50       0.04     12.50
# .fun               0.02     6.25       0.14     43.75
# structure          0.02     6.25       0.12     37.50
# [[                 0.02     6.25       0.08     25.00
# FUN                0.02     6.25       0.06     18.75
# rbind.fill         0.02     6.25       0.06     18.75
# anyDuplicated      0.02     6.25       0.02      6.25
# gc                 0.02     6.25       0.02      6.25
# is.array           0.02     6.25       0.02      6.25
# list               0.02     6.25       0.02      6.25
# mean.default       0.02     6.25       0.02      6.25

Here's the parallel version with snow:

library(doSNOW)
cl <- makeSOCKcluster(2)
registerDoSNOW(cl)

Rprof()
system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE))
#    user  system elapsed 
#    0.46    0.01    0.73
Rprof(NULL)
summaryRprof()$by.self
#                     self.time self.pct total.time total.pct
# .Call                    0.24    33.33       0.24     33.33
# socketSelect             0.16    22.22       0.16     22.22
# lazyLoadDBfetch          0.08    11.11       0.08     11.11
# accumulate.iforeach      0.04     5.56       0.06      8.33
# rbind.fill               0.04     5.56       0.06      8.33
# structure                0.04     5.56       0.04      5.56
# <Anonymous>              0.02     2.78       0.54     75.00
# lapply                   0.02     2.78       0.04      5.56
# constantFoldEnv          0.02     2.78       0.02      2.78
# gc                       0.02     2.78       0.02      2.78
# stopifnot                0.02     2.78       0.02      2.78
# summary.connection       0.02     2.78       0.02      2.78

yeah, we run into that problem when we run things on the cluster. How can I definitively tell that's what's going on here? — wespiserA, Aug 24 '11 at 19:31
I don't know if it's _definitive_ but I've provided some evidence in my edit. — Joshua Ulrich, Aug 24 '11 at 20:04

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

2 Answers2

Linked