1

I've spent hours reading for using ff package and couldn't get a handle on this topic yet. Basically, I'd like to run a analysis on a big data and save the results/statistics from the analysis.

I modified the example code written in ff package using biglm on my data set. http://cran.r-project.org/web/packages/ff/ff.pdf The problem is very similar to this one Modeling a very big data set (1.8 Million rows x 270 Columns) in R

Here's my code below

library(ff)
library(ffbase)
library(doSNOW)


registerDoSNOW(makeCluster(4, type = "SOCK"))
memory.limit(size=32000)

setwd('Z:/data')
wd <- getwd()
data.path <- file.path(wd,'ffdb')
data.path.train <- file.path(data.path,'train')

ff.train <- read.table.ffdf(file='train.tsv', sep='\t')

save.ffdf(ff.train, dir=data.path.train)


library(biglm)

# Here I'm implementing biglm model on ffdf data
# Vi represents the column names

form <- V27 ~ V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15
ff.biglm <- for (i in chunk(ff.train, by=500)){
  if (i[1]==1){
    message("first chunk is: ", i[[1]],":",i[[2]])
    biglmfit <- biglm(form, data=ff.train[i,,drop=FALSE])
  }else{
    message("next chunk is: ", i[[1]],":",i[[2]])
    biglmfit <- update(biglmfit, ff.train[i,,drop=FALSE])
  }
}

When the above code is ran, it gives the following error message:

first chunk is: 1:494 Error: cannot allocate vector of size 19.4 Gb In addition: There were 50 or more warnings (use warnings() to see the first 50)

Is this error message in regards to the size of biglmfit cannot be fitting to memory? Any work around to save biglmfit into ffdf data type? Or for that matter, is there any ways to store analysis statistics into ffdf type in chunk? Thank you.

EDIT:

vmode(ff.train)
   V1        V2        V3        V4        V5        V6        V7        V8        V9   

    V10 
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
      V11       V12       V13       V14       V15       V16       V17       V18       V19       V20 
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer" 
      V21       V22       V23       V24       V25       V26       V27 
"integer" "integer" "integer" "integer" "integer" "integer" "integer" 
Community
  • 1
  • 1
ElegantFellow
  • 599
  • 2
  • 5
  • 10
  • You need to provide more details on your V27 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10 + V11 + V12 + V13 + V14 + V15 columns. bigglm will use model.matrix at a certain time which will put your factors in numeric matrices. How many factors do you have in these columns and how much levels does each factor have? –  Sep 09 '13 at 20:24
  • @Jwijffels I ran vmode(ff.train) command and the data types are shown in EDIT above – ElegantFellow Sep 10 '13 at 16:30
  • showing vmode is not relevant here, the important part in the question was: **how much levels does each factor have?** –  Sep 10 '13 at 16:53
  • Do you have a commend to show the data type of ff data type? str(ff.train) shows a big list. – ElegantFellow Sep 10 '13 at 18:57
  • lapply(physical(ff.train), levels) for showing the levels of each column –  Sep 10 '13 at 19:06
  • Ok, it generates too big of list to post here. I admit that my formula(V27 ~ V1 + V2 + ..., etc) is just wrong. I better just give you the data source I'm tinkering http://www.kaggle.com/c/stumbleupon/data I'm not so sure about how biglm generates matrices. Could you elaborate on it? – ElegantFellow Sep 10 '13 at 19:39
  • Have a look at ?model.matrix and look at the examples of that function. –  Sep 10 '13 at 19:49

0 Answers0