5

I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once. This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.

I would like to split these up into 3 parts:

  1. Algorithms that update or work on parameter estimates in parallel

  2. Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)

  3. Work on part of the data

And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating. Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?

  • 2
    nobody ever got fired for using hadoop – lynks Nov 26 '12 at 17:23
  • not sure how monetdb handles multiple processors, but it certainly works fast on big data and is worth a look :) -- http://usgsd.blogspot.com/2012/11/why-and-how-to-install-monetdb-with-r.html – Anthony Damico Nov 26 '12 at 18:11
  • Thx but I am more looking for an algorithm, not a data storage. –  Nov 27 '12 at 08:17
  • related : http://stackoverflow.com/questions/8342986/big-data-process-and-analysis-in-r, http://stackoverflow.com/questions/1727772/quickly-reading-very-large-tables-as-dataframes-in-r, http://stackoverflow.com/questions/5527850/how-much-data-can-r-handle – Joris Meys Nov 27 '12 at 15:45
  • thx Joris, I'm aware of all of these, but I'm looking for a parallelized algorithm, not the tricks on how to handle big data but statistical models which are parallel in the core. –  Nov 27 '12 at 18:49

2 Answers2

1

Have you read through the High Performance Computing Task View on CRAN?

It covers many of the points you mention and gives overviews of packages in those areas.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • Yes, I have, but my impression is that it is more focussed on parallelising the optimisation of hyperparameters over several cores instead of parallelising the algorithm itself. I am more looking for the algorithms which work in parallel on the same parameter estimates. –  Nov 26 '12 at 20:11
  • @jwijffels, I expect the maintainer of the Task View would appreciate anything new that you learn and would add it to the task view. – Greg Snow Nov 26 '12 at 20:15
  • Ok, will do. For info: I am more looking for these kind of things in R pages.cs.wisc.edu/~brecht/papers/hogwildTR.pdf but not necessarily related to collaborative filtering but other algorithms as well. –  Nov 26 '12 at 20:32
1

Random forest are trivial to run in parallel. It's one of the examples in the foreach vignette:

x <- matrix(runif(500), 100)
y <- gl(2, 50)
library(randomForest); library(foreach)
rf <- foreach(ntree=rep(250, 4), .combine=combine,
.packages='randomForest') %dopar% randomForest(x, y, ntree=ntree)

You can use this construct to split your forest over every core in your cluster.

Zach
  • 29,791
  • 35
  • 142
  • 201
  • Thanks. Great example. I didn't knew about the combine function in the randomForest package. Randomforest are indeed perfectly suited for this. For me this is 'Algorithms that update or work on parameter estimates in parallel'. Any other suggestion? –  Nov 27 '12 at 18:46
  • DEoptim can also be run in parallel, and since it's a general-purpose optimizer, can be used for a variety of algorithms. – Zach Nov 27 '12 at 19:28