Not with ddply
.
There are a number of options.
- Put the data in a database then work with the database (
RODBC
/ sqldf
/ dplyr
)
- Use a more memory efficient representation in R (
data.table
)
Database approach.
using sqldf to create the databse
see https://stackoverflow.com/a/4335739/1385941
library(sqldf)
# create database
sqldf("attach my_db as new")
# read data from csv directly to database
read.csv.sql("./myfile.csv", sql = "create table main.mycsv as select * from file",
dbname = "my_db")
# perform the query in SQL
dat2 <- sqldf("Select ColA, ColB, mean(ColC) as mean, stdev(ColC) / sqrt(count(*)) from main.mycsv",
dbname = "my_db")
Using dplyr
(a complete re-write of the ddply
like facilities of plyr)
See the vignette
library(dplyr)
library(RSQLite)
# reference database (created in previous example)
my_db <- src_sqlite('my_db')
# reference the table created from mycsv.csv
dat <- tbl(my_db ,"mycsv")
dat2 <- dat %>%
group_by(ColA, ColB) %>%
summarize(mean = mean(ColC), se = sd(ColC) / sqrt(n()))
use data.table
# fread is a fast way to read in files!
dat <- fread('./myfile.csv')
dat2 <- dat[,list(mean=mean(colC),se=sd(colC)/sqrt(.N)),by = list(ColA,ColB))