using ddply for very large file

Question

I would like to summarize a csv file that has too many rows to fit into memory. Here's what I would like to do

library(plyr)
dat = read.csv("./myfile.csv",stringsAsFactors=FALSE,header = TRUE)
dat2 = ddply(dat,~colA+colB,summarise,mean=mean(colC),se=sd(colC)/sqrt(length(colC)))

I can change the code to read line-by-line using readlines, but it's no longer clear how to use ddply in that scenario.

How many rows are there? You might want to split the file into chunks and use ddply on each. (or for speed, use dplyr or data.table.) — Frank, Nov 10 '14 at 23:11
@Frank but would the values be correct? the means would be but the standard errors would not, right? — fairidox, Nov 10 '14 at 23:25
If the chunks were the same size, the means would be correct but the standard errors would not. if the chunks were different sizes, both would be incorrect. But you can get around this manually by keeping track of `sum(colC)` and `sum(colC*colC)` and calculating means and standard errors later. — Frank, Nov 10 '14 at 23:38
If you want speed and memory efficiency, don't use `plyr` and `ddply`. It isn't built for this, `dplyr` and `data.table` are. — mnel, Nov 11 '14 at 00:01

score 4 · Accepted Answer · edited May 23 '17 at 12:20

Not with ddply.

There are a number of options.

Put the data in a database then work with the database (RODBC / sqldf / dplyr)
Use a more memory efficient representation in R (data.table)

Database approach.

using sqldf to create the databse

see https://stackoverflow.com/a/4335739/1385941

library(sqldf)
# create database
sqldf("attach my_db as new")
# read data from csv directly to database
read.csv.sql("./myfile.csv", sql = "create table main.mycsv as select * from file", 
          dbname = "my_db")
# perform the query in SQL
dat2 <- sqldf("Select ColA, ColB, mean(ColC) as mean, stdev(ColC) / sqrt(count(*)) from main.mycsv", 
   dbname = "my_db")

Using `dplyr` (a complete re-write of the `ddply` like facilities of plyr)

See the vignette

library(dplyr)
library(RSQLite)
# reference database (created in previous example)
my_db <- src_sqlite('my_db')
# reference the table created from mycsv.csv
dat <- tbl(my_db ,"mycsv")

dat2 <- dat %>%
          group_by(ColA, ColB) %>% 
          summarize(mean = mean(ColC), se = sd(ColC) / sqrt(n()))

use data.table

# fread is a fast way to read in files!
dat <- fread('./myfile.csv')
dat2 <- dat[,list(mean=mean(colC),se=sd(colC)/sqrt(.N)),by = list(ColA,ColB))

Thanks, great answer I found data.table to be very fast. However, perhaps I accepted this a bit too soon, as all of these procedures share one major drawback. Specifically, the first step is to load the entire file into memory, this is just not a scalable approach and can't work for very large files. — fairidox, Nov 11 '14 at 04:19

using ddply for very large file

1 Answers1

Database approach.

using sqldf to create the databse

Using dplyr (a complete re-write of the ddply like facilities of plyr)

use data.table

Using `dplyr` (a complete re-write of the `ddply` like facilities of plyr)