3

I have a large dataset (about 13GB uncompressed) and I need to load it repeatedly. The first load (and save to a different format) can be very slow but every load after this should be as fast as possible. What is the fastest way and fastest format from which to load a data set?

My suspicion is that the optimal choice is something like

 saveRDS(obj, file = 'bigdata.Rda', compress = FALSE)
 obj <- loadRDS('bigdata.Rda)

But this seems slower than using fread function in the data.table package. This should not be the case because fread converts a file from CSV (although it is admittedly highly optimized).

Some timings for a ~800MB dataset are:

> system.time(tmp <- fread("data.csv"))
Read 6135344 rows and 22 (of 22) columns from 0.795 GB file in 00:00:43
     user  system elapsed 
     36.94    0.44   42.71 
 saveRDS(tmp, file = 'tmp.Rda'))
> system.time(tmp <- readRDS('tmp.Rda'))
     user  system elapsed 
     69.96    2.02   84.04

Previous Questions

This question is related but does not reflect the current state of R, for example an answer suggests reading from a binary format will always be faster than a text format. The suggestion to use *SQL is also not helpful in my case as the entire data set is required, not just a subset of it.

There are also related questions about the fastest way of loading data once (eg: 1).

Community
  • 1
  • 1
orizon
  • 3,159
  • 3
  • 25
  • 30
  • 2
    How do you expect people to answer without supplying any data? I don't really see the point of this question, as it seems you've already discovered the fast way(s). – Rich Scriven Aug 07 '15 at 22:04
  • 2
    I am one of the down voters, and I'm very close to closing it as a duplicate, as I am really struggling to see what information will come from this question that hasn't been thoroughly covered elsewhere. – joran Aug 07 '15 at 22:18
  • Alright, I pulled the trigger. If you disagree plead your case in the comments and perhaps I will be overruled. – joran Aug 07 '15 at 22:21
  • @joran I can so no other answer that explains why fread is faster than loadRDS. This seems inconsistent with the belief that loading binary formats is always faster than loading text formats. Would it be helpful if I ask that question only? – orizon Aug 07 '15 at 22:21
  • @joran I don't currently have the inclination or enthusiasm to contest this, but the choice of duplicate seems poor. The duplicate question chosen relates only to reading text data in tabular format. The [ similar question I linked in my question](http://stackoverflow.com/questions/4756989/how-to-load-data-quickly-into-r) is much closer to the question I asked. – orizon Aug 07 '15 at 22:29
  • @orizon I don't see a single "why" in your question. If you're interested in why one method is faster than another, I think that's a completely different question. Though I'm not sure that would make a good question for SO, as the complete answer could probably be a long paper. – Gregor Thomas Aug 07 '15 at 22:29
  • @Gregor You are right that this is not really the question I had asked but it is a compromise question I was interested in. I do still feel that my question is unanswered by the purported duplicates, but at this point the barrier to getting an answer on stackoverflow is too high for it to be appealing to me and I will look elsewhere. – orizon Aug 07 '15 at 22:36
  • I posted a link in the R chat room inviting folks to disagree and overrule me, to make sure it get plenty of eyeballs from R folks. – joran Aug 07 '15 at 23:37

1 Answers1

5

It depends on what you plan on doing with the data. If you want the entire data in memory for some operation then I guess your best bet is fread or readRDS (the file size for a data saved in RDS is much much smaller if that matters to you).

If you will be doing summary operations on the data I have found one time conversion to a database (using sqldf) a much better option, as subsequent operations are much more faster by executing sql queries on the data, but that is also because I don't have enough RAM to load 13 GB files in memory.

Rohit Das
  • 1,962
  • 3
  • 14
  • 23
  • Some of the reason the file size using saveRDS is smaller is because compression is used and this degrades performance; although the more efficient representation of factors and characters also probably helps. – orizon Aug 07 '15 at 22:12