I'm trying to find out which statistical/data mining algorithms in R or R packages at CRAN/github/R-Forge exist that can handle large datasets either in parallel on 1 server or sequentially without running into out-of-memory issues or which work on several machines at once. This in order to evaluate if I can easily port them to work with ff/ffbase like ffbase::bigglm.ffdf.
I would like to split these up into 3 parts:
Algorithms that update or work on parameter estimates in parallel
Buckshot (https://github.com/lianos/buckshot)
lm.fit @ Programming For Big Data (https://github.com/RBigData)
Algorithms that work sequentially (get data in R but only use 1 process and only 1 process updates the parameters)
bigglm (http://cran.r-project.org/web/packages/biglm/index.html)
Compound Poisson linear models (http://cran.r-project.org/web/packages/cplm/index.html)
Kmeans @ biganalytics (http://cran.r-project.org/web/packages/biganalytics/index.html)
Work on part of the data
- Distributed text processing (http://www.jstatsoft.org/v51/i05/paper)
And I would like to exclude simple parallelisation like optimising over a hyperparameter by e.g. crossvalidating. Any other pointers to these kind of models/optimisers or algorithms? Maybe Bayesian? Maybe a package called RGraphlab (http://graphlab.org/)?