6

Hi I am trying to use ddply in the plyr library in R, with the MC package. It doesn't seem to be speeding up the computation. This is the code I run:

require(doMC)
registerDoMC(4)
getDoParWorkers()
##> 4
test <- data.frame(x=1:10000, y=rep(c(1:20), 500))
system.time(ddply(test, "y", mean))
  # user  system elapsed 
  # 0.015   0.000   0.015
system.time(ddply(test, "y", mean, .parallel=TRUE))
  # user  system elapsed 
  # 223.062   2.825   1.093 

Any ideas?

Alex
  • 19,533
  • 37
  • 126
  • 195
  • 3
    Depending on the actual calculations you're performing, the `data.table` package might really really speed them up. For all its virtues, the 'split-apply-combine' implementation in the `plyr` package is actually fairly slow, whereas `data.table` is first and foremost designed for speed. (If you're intrigued, just search SO for something like `[r] [data.table] plyr` to get a lot of possible starting points). – Josh O'Brien Mar 21 '12 at 17:00

2 Answers2

10

The mean function operates too quickly relative to the communication costs required to distribute the split sections to each core and retrieve the results.

This is a common "problem" people run into with distributed computing. They expect it to make everything run faster because they forget there are costs (communication between the nodes) as well as benefits (using multiple cores).

Something specific to parallel processing in plyr: only the function is run on multiple cores. The splitting and combining still is still done on a single core, so the function you're applying would have to be very computationally intensive to see a benefit when using plyr functions in parallel.

Joshua Ulrich
  • 173,410
  • 32
  • 338
  • 418
  • This is just an example. The real data frame I have is 4 million rows with 2000 groups. The code with parallel and without returns the same time. – Alex Mar 21 '12 at 16:23
  • 1
    @Alex: If your data.frame is huge, then your function will need to be all the more computationally intensive because most of the time spent in `ddply` is going to be splitting and combining, not applying the function. – Joshua Ulrich Mar 21 '12 at 16:24
  • i see.. so the part it distributes is the actual application of the function.. the splitting it still does on 1 core? – Alex Mar 21 '12 at 16:30
  • 1
    @Alex: Not just the splitting, but also the combining. – Joshua Ulrich Mar 21 '12 at 16:35
  • that makes perfect sense logically. it's unfortunate since the splitting and combining is the time intensive thing. thanks! – Alex Mar 21 '12 at 16:45
1

Continuation to Joshua's answer, there is a fix if you want to quicken up this operation. It is inspired by the Map-reduce ideology and I had done a POC on a sample dataset a while back.

I used the snowfall library- I believe you can work with doMC as well.

# On my phone, please pardon typos/bugs

test <- data.frame(x=1:1000000, y=rep(c(1:20), 500))

testList = list()
testList[[1]] <- test[c(1:250000),]
testList[[2]] <- test[c(250001:500000),]
testList[[3]] <- test[c(500001:750000),]
testList[[4]] <- test[c(750001:1000000),]

# Write a function for the above - Need to find optimum number of splits

sfInit(parallel = TRUE, cpus=4)
sfCluster(plyr)
meanList = sfClusterSpplyLB(testList, function(x) ddply(test, "y", mean))

sfStop()

aggregate(meanList, by=list(y), FUN=mean)

This might help you, given that we are now doing the split-combine routine in a distributed fashion. This works for means when the size of the splits are the same, it works for sums, min/max, count etc are OK but there are some operations that we can't use this for.

jackStinger
  • 2,035
  • 5
  • 23
  • 36