2

I'm trying to work with a 1909x139352 dataset using R. Since my computer only has 2GB of RAM, the dataset turns out to be too big (500MB) for the conventional methods. So I decided to use the ff package. However, I've been having some troubles. The function read.table.ffdf is unable to read the first chunk of data. It crashes with the next error:

txtdata <- read.table.ffdf(file="/directory/myfile.csv", 
                           FUN="read.table", 
                           header=FALSE, 
                           sep=",", 
                          colClasses=c("factor",rep("integer",139351)), 
                          first.rows=100, next.rows=100, 
                          VERBOSE=TRUE)

  read.table.ffdf 1..100 (100)  csv-read=77.253sec
  Error en  ff(initdata = initdata, length = length, levels = levels, ordered = ordered,  : 
   write error

Does anyone have any idea of what is going on?

Chris Townsend
  • 3,042
  • 27
  • 31

3 Answers3

2

This error message indicates that you have too many open files. In ff, every column in your ffdf is a file. You can only have a limited number of files open - and you have hit that number. See my reply on Any ideas on how to debug this FF error?.

So in your case, using simply read.table.ffdf won't work because you have 139352 columns. It is possible however to import it in ff but you need to be carefull when opening columns while getting data in RAM to avoid this issue.

Community
  • 1
  • 1
  • Thanks! So what should I do in order to be able to import it in ff? – Jairos Arredondo Dec 26 '12 at 16:40
  • You need to overload the functions "[<-.ffdf" and "[.ffdf" from package ff such that they work by groups of say 1000 columns instead of all at once. So you open 1000 columns, use "[<-.ffdf", close the 1000 columns untill and go to the next 1000 columns untill you have done your 139352 columns. Or you can ask the package author from ff to incorporate that in his package. It is quite trivial I believe and it's a feature I also would like to have :) but I believe the best is to change it in the ff package itself. –  Dec 28 '12 at 08:49
  • @jwijffels, I've posted an alternate answer that enables one to use ffdf without the grouping you describe. I thought you might find it helpful - YMMV. I would really appreciate and example of the overloading you mention as that would be helpful when one doesn't have access to change the max open files settings. – Chris Townsend Dec 02 '16 at 16:34
1

Your data set really isn't that big.. It might help if you said something about what you're trying to do with it. this might help: Increasing Available memory in R or if that doesn't work, the data.table package is VERY fast and doesn't hog memory when manipulating data.tables with the := operator.
and as far as read.table.ffdf, check this out.. read.table.ffdf tutorial, if you read carefully, it gives hints and details about optimizing your memory usage with commands like gc() and more.

Community
  • 1
  • 1
N8TRO
  • 3,348
  • 3
  • 22
  • 40
  • and data.table has fread new in 1.8.7 but this is very new. I've loaded files larger than 500MB on my 2GB ram netbook, but it is 64bit, the addressability is needed by fread. Worth a quick attempt @jairos. – Matt Dowle Dec 25 '12 at 11:04
  • Thank you guys! I decided to use ff because previous attempts to do it with read.table were unsuccessful. My computer would get really slow due to the overuse of RAM. Maybe the file isn't that big, but the dataset has almos 140,000 columns, which a think is the real problem. I'll give a try to de data.table function and let you know. – Jairos Arredondo Dec 26 '12 at 16:49
1

I recently encountered this problem with a data frame that had ~ 3,000 columns. The easiest way to get around this is to adjust the maximum number of files allowed open for your user account. The typical system is set to ~ 1024 and that is a very conservative limit. Do note that it is set to prevent resource exhaustion on the server.

On Linux:

Add the following to your /etc/security/limits.conf file.

youruserid hard nofile 200000 # you may enter whatever number you wish here youruserid soft nofile 200000 # whatever you want the default to be for each shell or process you have running

On OS X:

Add or edit the following in your /etc/sysctl.con file. kern.maxfilesperproc=200000 kern.maxfiles=200000

You'll need to log out and log back in but then the original poster would be able to use the ffdf to open his 139352 column data frame.

I've posted more about my run-in with this limit here.

Chris Townsend
  • 3,042
  • 27
  • 31