Functions for creating and reshaping big data in R using the FF package

Question

I'm new to R and the FF package, and am trying to better understand how FF allows users to work with large datasets (>4Gb). I have spent a considerable amount of time trawling the web for tutorials, but the ones I could find generally go over my head.

I learn best by doing, so as an exercise, I would like to know how to create a long-format time-series dataset, similar to R's in-built "Indometh" dataset, using arbitrary values. Then I would like to reshape it into wide format. Then I would like to save the output as a csv file.

With small datasets this is simple, and can be achieved using the following script:

##########################################
#Generate the data frame

DF<-data.frame()
for(Subject in 1:6){
  for(time in 1:11){
    DF<-rbind(DF,c(Subject,time,runif(1)))
  }
}
names(DF)<-c("Subject","time","conc")

##########################################
#Reshape to wide format

DF<-reshape(DF, v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")

##########################################
#Save csv file

write.csv(DF,file="DF.csv")

But I would like to learn to do this for file sizes of approximately 10 Gb. How would I do this using the FF package? Thanks in advance.

score 3 · Accepted Answer · answered Jan 31 '14 at 10:48

The function reshape does not explicitly exists for ffdf objects. But it is quite straightforward to execute with functionality from package ffbase. Just use ffdfdply from package ffbase, split by Subject and apply reshape inside the function.

An example on the Indometh dataset with 1000000 subjects.

require(ffbase)
require(datasets)
data(Indometh)

## Generate some random data
x <- expand.ffgrid(Subject = ff(factor(1:1000000)), time = ff(unique(Indometh$time)))
x$conc <- ffrandom(n=nrow(x), rfun = rnorm)
dim(x)
[1] 11000000        3

## and reshape to wide format
result <- ffdfdply(x=x, split=x$Subject, FUN=function(datawithseveralsplitelements){
  df <- reshape(datawithseveralsplitelements, 
              v.names = "conc", idvar = "Subject", timevar = "time", direction = "wide")
  as.data.frame(df)
})
class(result)
[1] "ffdf"
colnames(result)
[1] "Subject"   "conc.0.25" "conc.0.5"  "conc.0.75" "conc.1"    "conc.1.25" "conc.2"    "conc.3"    "conc.4"    "conc.5"    "conc.6"    "conc.8"   
dim(result)
[1] 1000000      12

score 0 · Answer 2 · answered Jan 31 '14 at 07:36

0

You would be hard put to construct a less efficient method than what you offer. Using rbind.data.frame is incredibly inefficient. Try this instead to create a six thousand line dataset for 6 subjects:

DF <- data.frame( Subj = rep( 1:6, each=1000), matrix(runif(6000*11), nrow=6000) )

Scaling it up to have a billion items (US billion, not UK billion) should give you about an 10GB object, so maybe trying 80 million lines or so?

I think asking for a tutorial in the ff-package is out-of-scope for SO. Please read the FAQ. Such questions are generally closed because the questioner demonstrates that they don't really know what they are talking about.

answered Jan 31 '14 at 07:36

IRTFM

258,963
21
364
487

Second, You have obviously not read my question. I did not ask for a tutorial in the ff-package. I asked how to do a very specific task. Your answer shows that you do not understand how to do that task. – Luke23 Feb 03 '14 at 00:05
Well, I did read the question as can be seen by my efforts at coding. But I admit to not knowing about the incredibly compact solution that jwiffels provided with `ffdfdply`. So I guess I'll just upvote you both. – IRTFM Feb 05 '14 at 01:13

Functions for creating and reshaping big data in R using the FF package

2 Answers2

Linked