Apply the aggregate function on large datasets on r-project

Question

I need to use the aggregate function on a 18gb dataset consisting in numerical and categorical dataset in CSV format (with more than 60 million records in some cases).

I have tried various packages like ff or bigmemory but with no success. The problem is that I have to group data by the values of some columns applying a given user defined function on one column as aggregate function makes or on several columns as split function does.

A short example of this:

country day month year f person_id age...
1 23 01 2014 4005 5000 20...
1 23 01 20014 4005 244 43...
....

grouping by country and month we want to know the number of passengers as aggregate does on data.frame or data.table (no large datasets are supported) or grouping by age and sex apply a analysis over contry day month and day as split function can do on data.frame or data.table (so no large datsets).

Can you folks let me know a solution to this? Please any hints can be helpful. Thanks a lot for collaboration!

You should post output of dput(head(df)) here and mention one specific problem in one post. It is best to give example of the output that you need. — rnso, Aug 31 '14 at 16:09
If you have enough RAM on your machine, then the package `data.table` should be able to cope with data this size. You may also be able to split the data across several computing machines, e.g. by using Hadoop or use [commercial solutions](http://www.revolutionanalytics.com/whitepaper/revolution-r-enterprise-scaler-fast-highly-scalable-r-multiple-processors). — Andrie, Aug 31 '14 at 16:52
Are you working in a Linux environment or a Windows environment? — Mike.Gahan, Aug 31 '14 at 17:41
If you're RAM-strapped & you can spare some $ on Amazon (and you are allowed to put the data up on Amazon temporarily), you can fire up an instance with sufficient RAM to handle that in-memory with `data.table`. Another option (if you can't do that) is to setup a single instance hadoop environment where the R code can execute ([this](http://rdatamining.wordpress.com/2014/05/30/step-by-step-guide-to-setting-up-an-r-hadoop-system/) might help kick-start such an environment, since R would be running in-hadoop vs handing the data set on it's own. — hrbrmstr, Aug 31 '14 at 17:54
Alternatively, stick the data in SQL (MySQL or SQLite) and use `dplyr` and `tidyr` which *should* perform most of the grouping in-database and let you run the functions you need on the groups (which may work with the RAM you have). — hrbrmstr, Aug 31 '14 at 17:55
If you are using ff, you should use ffdfdply from package ffbase as shown here: http://stackoverflow.com/questions/20951433/aggregation-using-ffdfdply-function-in-r/20954315#20954315 . It allows you to do a split and in the split you get your data in RAM where you can use standard functions like grouping by using data.table which handle things in RAM. — , Sep 01 '14 at 08:14
Thank you all for your answers. I currently work on windows enviroment and I almost get my objetive by using bigmemory but need read my table from several files csv yet. — Carlos M. Fernandez, Sep 08 '14 at 11:45

Apply the aggregate function on large datasets on r-project

0 Answers0