Recommended package for very large dataset processing and machine learning in R

Question

It seems like R is really designed to handle datasets that it can pull entirely into memory. What R packages are recommended for signal processing and machine learning on very large datasets that can not be pulled into memory?

If R is simply the wrong way to do this, I am open to other robust free suggestions (e.g. scipy if there is some nice way to handle very large datasets)

Have a look at the "Large memory and out-of-memory data" subsection of the [high performance computing task view](http://cran.r-project.org/web/views/HighPerformanceComputing.html) on CRAN. [bigmemory](http://cran.r-project.org/web/packages/bigmemory/index.html) and [ff](http://cran.r-project.org/web/packages/ff/index.html) are two popular packages. Also, consider storing data in a database and reading in smaller batches for analysis. — jthetzel, Jun 15 '12 at 17:54

jthetzel · Accepted Answer · 2012-06-20T22:16:41.463

Have a look at the "Large memory and out-of-memory data" subsection of the high performance computing task view on CRAN. bigmemory and ff are two popular packages. For bigmemory (and the related biganalytics, and bigtabulate), the bigmemory website has a few very good presentations, vignettes, and overviews from Jay Emerson. For ff, I recommend reading Adler Oehlschlägel and colleagues' excellent slide presentations on the ff website.

Also, consider storing data in a database and reading in smaller batches for analysis. There are likely any number of approaches to consider. To get started, consdier looking through some of the examples in the biglm package, as well as this presentation from Thomas Lumley.

And do investigate the other packages on the high-performance computing task view and mentioned in the other answers. The packages I mention above are simply the ones I've happened to have more experience with.

But with ff, bigmemory or databases... can you perform any operation offered by R or any package directly? Or you can only run the functions that ff, bigmemory or the database engine have implemented? (without needing to break the data on small pieces). For example I want to run a regression on a 50GB numeric file or calculate the median. Or I want to apply DBScan, or just want to create another vector where each elements is expressed as some operation with the old ones BB[i]=AA[i]*AA[i-1]+AA[i-2]. Can I do this with R and the ff, bigmemory or any database connector? — skan, Mar 14 '15 at 21:13

score 8 · Answer 2 · answered Jun 15 '12 at 18:08

I think the amount of data you can process is more limited by ones programming skills than anything else. Although a lot of standard functionality is focused on in memory analysis, cutting your data into chunks already helps a lot. Ofcourse, this takes more time to program than picking up standard R code, but often times it is quite possible.

Cutting up data can for exale be done using read.table or readBin which support only reading a subset of the data. Alternatively, you can take a look at the high performance computing task view for packages which deliver out of the box out of memory functionality. You could also put your data in a database. For spatial raster data, the excellent raster package provides out of memory analysis.

score 8 · Answer 3 · answered Jun 15 '12 at 18:28

For machine learning tasks I can recommend using biglm package, used to do "Regression for data too large to fit in memory". For using R with really big data, one can use Hadoop as a backend and then use package rmr to perform statistical (or other) analysis via MapReduce on a Hadoop cluster.

score 7 · Answer 4 · edited Oct 24 '13 at 20:07

It all depends on algorithms you need. If they may be translated into incremental form (when only small part of data is needed at any given moment, e.g. for Naive Bayes you can hold in memory only the model itself and current observation being processed), then the best suggestion is to perform machine learning incrementally, reading new batches of data from disk.

However, many algorithms and especially their implementations really require the whole dataset. If size of the dataset fits you disk (and file system limitations), you can use mmap package that allows to map file on disk to memory and use it in the program. Note however, that read-writes to disk are expensive, and R sometimes likes to move data back and forth frequently. So be careful.

If your data can't be stored even on you hard drive, you will need to use distributed machine learning systems. One such R-based system is Revolution R which is designed to handle really large datasets. Unfortunately, it is not open source and costs quite a lot of money, but you may try to get free academic license. As alternative, you may be interested in Java-based Apache Mahout - not so elegant, but very efficient solution, based on Hadoop and including many important algorithms.

With Revolution R you can apply some functions on large datasets, but only the functions implemented on the package Revoscaler. You don't have a generic way to use any R function or package on large datasets. For example if you want to run a DBSCAN clustering you would need to rewrite the whole method with the basic functions offered by revoscaler (or similar packages). — skan, Mar 14 '15 at 21:16

score 3 · Answer 5 · answered Mar 08 '13 at 09:57

3

If the memory is not sufficient enough, one solution is push data to disk and using distributed computing. I think RHadoop(R+Hadoop) may be one of the solution to tackle with large amount dataset.

answered Mar 08 '13 at 09:57

yanbohappy

94
2

Recommended package for very large dataset processing and machine learning in R

5 Answers5

Linked

Related