How to load data quickly into R?

Question

I have some R scripts, where I have to load several dataframe in R as quickly as possible. This is quite important as reading the data is the slowest part of the procedure. E.g.: plotting from different dataframes. I get the data in sav (SPSS) format, but I could transform it to any format as suggested. Merging the dataframes is not an option unfortunately.

What could be the fastest way to load the data? I was thinking of the following:

transform from sav to binary R object (Rdata) in the first time, and later always load this, as it seems a lot quicker than read.spss.
transform from sav to csv files and reading data from those with given parameters discussed in this topic,
or is it worth setting up a MySQL backend on localhost and load data from that? Could it be faster? If so, can I also save any custom attr values of the variables (e.g. variable.labels from Spss imported files)? Or this should be done in a separate table?

Any other thoughts are welcome. Thank you for every suggestion in advance!

I made a little experiment below based on the answers you have given, and also added (24/01/2011) a quite "hackish" but really speedy solution loading only a few variables/columns from a special binary file. The latter seems to be the fastest method I can imagine now, that is why I made up (05/03/2011: ver. 0.3) a small package named saves to deal with this feature. The package is under "heavy" development, any recommendation is welcome!

I will soon post a vignette with accurate benchmark results with the help of microbenchmark package.

Congratulations: you have a business case for a shiny new SSD! — Richie Cotton, Jan 21 '11 at 12:00
@Richie Cotton: you are right :) But to tell you the truth: I do run my scripts on a machine with SSD attached, I would like to fine-tune the code. — daroczig, Jan 21 '11 at 12:36

daroczig · Answer 1 · 2011-03-06T19:34:19.117

Thank you all for the tips and answers, I did some summary and experiment based on that.

See a little test with a public database (ESS 2008 in Hungary) below. The database have 1508 cases and 508 variables, so it could be a mid-sized data. That might be a good example to do the test on (for me), but of course special needs would require an experiment with adequate data.

Reading the data from SPSS sav file without any modification:

> system.time(data <- read.spss('ESS_HUN_4.sav'))
   user  system elapsed 
  2.214   0.030   2.376

Loading with a converted binary object:

> save('data',file='ESS_HUN_4.Rdata')
> system.time(data.Rdata <- load('ESS_HUN_4.Rdata'))
   user  system elapsed 
   0.28    0.00    0.28

Trying with csv:

> write.table(data, file="ESS_HUN_4.csv")
> system.time(data.csv <- read.csv('ESS_HUN_4.csv'))
   user  system elapsed 
  1.730   0.010   1.824

Trying with "fine-tuned" csv loading:

> system.time(data.csv <- read.table('ESS_HUN_4.csv', comment.char="", stringsAsFactors=FALSE, sep=","))
   user  system elapsed 
  1.296   0.014   1.362

Also with package sqldf, which seems to load csv files a lot faster:

> library(sqldf)
> f <- file("ESS_HUN_4.csv")
>  system.time(bigdf <- sqldf("select * from f", dbname = tempfile(), file.format = list(header = T, row.names = F, sep="\t")))
   user  system elapsed 
  0.939   0.106   1.071

And also loading the data from a MySQL database running on localhost:

> library(RMySQL) 
> con <- dbConnect(MySQL(), user='root', dbname='test', host='localhost', password='')
> dbWriteTable(con, "data", as.data.frame(data), overwrite = TRUE)
> system.time(data <- dbReadTable(con, 'data'))
   user  system elapsed 
  0.583   0.026   1.055 
> query <-('SELECT * FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.270   0.020   0.473

Here, I think we should add the two system.time reported, as connecting to the data also counts in our case. Please comment, if I misunderstood something.

But let us see if querying only some variables, as eg. while plotting we do not need all the dataframe in most cases, and querying only two variables is enough to create a nice plot of them:

> query <-('SELECT c1, c19 FROM data')
> system.time(data.sql <- dbGetQuery(con, query))
   user  system elapsed 
  0.030   0.000   0.112

Which seems really great! Of course just after loading the table with dbReadTable

Summary: nothing to beat reading the whole data from binary file, but reading only a few columns (or other filtered data) from the same database table might be also weighted in some special cases.

Test environment: HP 6715b laptop (AMD X2 2Ghz, 4 Gb DDR2) with a low-end SSD.

UPDATE (24/01/2011): I added a rather hackish, but quite "creative" way of loading only a few columns of a binary object - which looks a lot faster then any method examined above.

Be aware: the code will look really bad, but still very effective :)

First, I save all columns of a data.frame into different binary objects via the following loop:

attach(data)
for (i in 1:length(data)) {
    save(list=names(data)[i],file=paste('ESS_HUN_4-', names(data)[i], '.Rdata', sep=''))
}
detach(data)

And then I load two columns of the data:

> system.time(load('ESS_HUN_4-c19.Rdata')) + 
>     system.time(load('ESS_HUN_4-c1.Rdata')) + 
>     system.time(data.c1_c19 <- cbind(c1, c19))
    user  system elapsed 
    0.003   0.000   0.002

Which looks like a "superfast" method! :) Note: it was loaded 100 times faster than the fastest (loading the whole binary object) method above.

I have made up a very tiny package (named: saves), look in github for more details if interested.

UPDATE (06/03/2011): a new version of my little package (saves) was uploaded to CRAN, in which it is possible to save and load variables even faster - if only the user needs only a subset of the available variables in a data frame or list. See the vignette in the package sources for details or the one on my homepage, and let me introduce also a nice boxplot of some benchmark done:

Comparison of different data frame/list loading mechanism by speed

This boxplot shows the benefit of using saves package to load only a subset of variables against load and read.table or read.csv from base, read.spss from foreign or sqldf or RMySQL packages.

Thanks for running the tests and presenting the results. This is consistent with my preconceived notions, but as a rampant empiricist it is great to see tests. — JD Long, Jan 21 '11 at 13:58
It's cool that you've disclosed the results. However, you shouldn't draw conclusions upon one trial. I usually write a script, repeat it n times, go out for a walk, and crunch the data when I come back. =) Anyway... `read.csv` is noticeably slower, and you can make it perform faster if you provide `colClasses` argument, otherwise the function will check the class of each column! And that is a drag. Besides, if you bypass the default behaviour of `stringsAsFactors`, your object will account for more memory. Anyway, thanks for the details! =) — aL3xa, Jan 21 '11 at 14:32
@aL3xa: you are right, *one test is no test*, and the "experiment" should have been also run on multiple (sized) databases also. I might do this sometime :) I am quite sure, that having run the "experiment" on an SSD drive might have also distorted the results (I think to the binary format's profit). And at last: thank you for the extra fine-tuning of `read.csv`, I will include it in the next test (what I am planning to do in the near future). — daroczig, Jan 21 '11 at 16:16
Hey, np. I'm glad that I helped. If I were you, I'd stick with "database solution", not only because of the speed, but the freedom to pull down data in a more flexible way. It's a common approach, and you can access MySQL/PostgreSQL from various programming languages - like Python, PHP, or whatever... — aL3xa, Jan 21 '11 at 16:44
Any chance you still have the same hardware and could revisit with data.table's fread? — russellpierce, May 30 '15 at 11:06
For that matter, I see you now recommend Redis for data small enough to fit into RAM. That benchmark would be interesting too! — russellpierce, May 30 '15 at 11:07
@rpierce I do not have that machine, but rerunning the tests shouldn't be hard -- everything is documented above. But to be honest there's not much use of this nowadays: as you also mentioned, `fread` is damn fast (although reading binary data is still faster), and we have good columnar DB storages out there -- no need to use the workaround described in the `saves` (experimental) package. — daroczig, Jun 20 '15 at 07:41
IMO you sell saves short. The nice part about saves though is that, if you have a distributed filesystem, there is no bottleneck in accessing the columnar store from a cluster. — russellpierce, Jun 20 '15 at 13:52
would it be possible to add `data.table` benchmarks to this for `csv` files? — road_to_quantdom, Mar 01 '22 at 17:48
I am a big fan of `data.table` myself, but this benchmark was run over a decade ago .. so I'd just leave this as-is, as it's not much more now than an interesting thread from the past without much relevance in 2022 E.g. now we have access to a bunch of other tools and technologies, such as ALTREP in the R world, but many new database engines etc — daroczig, Mar 03 '22 at 12:10

score 20 · Accepted Answer · answered Jan 21 '11 at 09:45

It depends on what you want to do and how you process the data further. In any case, loading from a binary R object is always going to be faster, provided you always need the same dataset. The limiting speed here is the speed of your harddrive, not R. The binary form is the internal representation of the dataframe in the workspace, so there is no transformation needed anymore.

Any kind of text file is a different story, as you include invariably an overhead : each time you read in the text file, the data has to be transformed to the binary R object. I'd forget about them. They are only useful for porting datasets from one application to another.

Setting up a MySQL backend is very useful if you need different parts of the data, or different subsets in different combinations. Especially when working with huge datasets, the fact that you don't have to load in the whole dataset before you can start selecting the rows/columns, can gain you quite some time. But this only works with huge datasets, as reading a binary file is quite a bit faster than searching a database.

If the data is not too big, you can save different dataframes in one RData file, giving you the opportunity to streamline things a bit more. I often have a set of dataframes in a list or in a seperate environment (see also ?environment for some simple examples). This allows for lapply / eapply solutions to process multiple dataframes at once.

+1 for "different dataframes in one RData file". And to add: `my_data<-new.env();load("my_data.RData", my_data)` is way to safe (without wipe out existing objects) load objects to R. — Marek, Jan 21 '11 at 10:41
thank you for your answer! I will accept this one, if no one comes with a more creative answer with the details showing how to load only 2 columns of data from a binary object :) I made a little experiment (see my separate answer if you are interested) based on your suggestion, and it also shows that reading from binary files wins by a lot in this "competition". MySQL partial load also proved it's importance and speed. — daroczig, Jan 21 '11 at 12:41
No extra creative answer arrived, so the point goes to you, thanks again! :) I am still wondering if writing a special save function (e.g.: save every column to separate binary object) would speed up loading objects if only a few columns were needed. I will post my experiments' results. — daroczig, Jan 24 '11 at 09:27

Matt Bannert · Answer 3 · 2011-01-21T10:48:39.377

I am pretty happy with RMySQL. I am not sure whether I got your question the right way, but labels should not be a problem. There are several convenience functions that just use the default SQL table and row names, but of course you can use some SQL statements.

I would say (apart from large datasets that justify the hustle) one of the main reasons to use RMySQL is being familiar more familiar with SQL syntax than with R data juggling functions. Personally I prefer GROUP BY over aggregate. Note, that using stored procedures from inside R does not work particularly well.

Bottom line... setting up an MySQL localhost is not too much effort – give it a try! I cannot tell exactly about the speed, but I have the feeling there's a chance it's faster. However, I will try and get back here.

EDIT: here's the test... and the winner is: spacedman

# SQL connection
source("lib/connect.R")

dbQuery <- "SELECT * FROM mytable"
mydata <- dbGetQuery(con,dbQuery)
system.time(dbGetQuery(con,dbQuery))
# returns
#user  system elapsed 
# 0.999   0.213   1.715 

save.image(file="speedtest.Rdata")
system.time(load("speedtest.Rdata"))
#user  system elapsed 
#0.348   0.006   0.358

File Size was only about 1 MB here. MacBook Pro 4 GB Ram 2.4 GHZ Intel Core Duo, Mac OSX 10.6.4, MySQL 5.0.41 Just never tried that, because I work with bigger dataset usually and loading is not the issue, rather processing... if there are time issues at all. +1 for the Q!

I can't imagine how a RDBMS is going to be any faster than simply picking up a .RData file from a local hard disk (doubly so if it's a SSD). Can't wait to see some real tests. There must be all sorts of overheads translating from MySQL's wire format into R. — Spacedman, Jan 21 '11 at 09:56
Thank you too for the answer and also for the test! I did also some experiment, see my answer for it. It really seems that binary data wins :) — daroczig, Jan 21 '11 at 12:17

score 1 · Answer 4 · answered Jan 21 '11 at 09:51

1

If it's at all possible, have the data transformed into a csv or other "simple" format to make reading as fast as possible (see Joris' answer). I import csv files en masse with the apply function, something along the lines of:

list.of.files <- as.list(list.files("your dir"))
lapply(list.of.files, FUN = function(x) {
    my.object <- read.table(...) # or some other function, like read.spss
})

answered Jan 21 '11 at 09:51

Roman Luštrik

69,533
24
154
197

Thank you for your answer and the tip on loading multiple files at once! It seems after a little test (see in my answer added), that loading binary objects is a lot faster than anything else. – daroczig Jan 21 '11 at 12:16
2

Use `dir` instead `list.files`. Save your keystrokes. One day, when you grow older and suffer from joint arthrosis, you'll recall my words... =) – aL3xa Jan 21 '11 at 14:45

How to load data quickly into R?

4 Answers4

Linked