7

I have a dataframe compose of 25 col and ~1M rows, split into 12 files, now I need to import them and then use some reshape package to do some data management. Each file is too large that I have to look for some "non-RAM" solution for importing and data processing, current I don't need to do any regression, I will have some descriptive statistics about the dataframe only.

I searched a bit and found two packages: ff and filehash, I read filehash manual first and found that it seems simple, just added some code on importing the dataframe into a file, the rest seems to be similar as usual R operations.

I haven't tried ff yet, as it comes with lots of different class, and I wonder if it worth investing time for understanding ff itself before my real work begins. But filehash package seems to be static for sometime and there's little discussion about this package, I wonder if filehash has become less popular, or even become obsolete.

Can anyone help me to choose which package to use? Or can anyone tell me what is the difference/ pros-and-cons between them? Thanks.

update 01

I am currently using filehash for importing the dataframe, and realize that it dataframe imported using filehash should be considered as readonly, as all the further modification in that dataframe will not be stored back to the file, unless you save it again, which is not very convenient in my view, as I need to remind myself to do the saving. Any comment on this?

Community
  • 1
  • 1
lokheart
  • 23,743
  • 39
  • 98
  • 169
  • 3
    You should also look at `bigmemory`. See http://stackoverflow.com/a/9432009/602276 – Andrie Mar 29 '12 at 07:29
  • What about a database solution (sqldf, MySQL...)? – Roman Luštrik Mar 29 '12 at 07:31
  • @Roman but I need to do some manipulation of the data like `melt` and `cast`, can these be done using `sqldf` or `RSqllite`? – lokheart Mar 29 '12 at 07:58
  • 2
    how much RAM do you have on your system? 25 x 1m doesn't seem that big. – JD Long Mar 29 '12 at 13:13
  • 2
    First of all, melt/cast mechanism is a giant memory waster and thus is unsuitable for even medium data. – mbq Mar 29 '12 at 17:53
  • @mbq yes, I realized that reshape package can't even handle 1/12 of my entire dataset, perhaps I should dump that and do the melt/cast manually – lokheart Mar 29 '12 at 23:31
  • You might try `pandas` in python. From what I can tell it is more efficient with memory usage http://pandas.pydata.org/ – Zach Jul 04 '12 at 20:05

0 Answers0