6

Let's say I have a 4GB dataset on a server with 32 GB.

I can read all of that into R, make a data.table global variable and have all of my functions use that global as a kind of in-memory data-base. However, when I exit R and restart, I have to read that from disk again. Even with smart disk cacheing strategies (save/load or R.cache) I have 10 seconds delay or so getting that data in. Copying that data takes about 4 seconds.

Is there a good way to cache this in memory that survives the exit of an R session?

A couple of things comes to mind, RServe, redis/Rredis, Memcache, multicore ... Shiny-Server and Rstudio-Server also seem to have ways of solving this problem.

But then again, it seems to me that perhaps data.table could provide this functionality since it appears to move data outside of R's memory block anyway. That would be ideal in that it wouldn't require any data copying, restructuring etc.

Update:

I ran some more detailed tests and I agree with the comment below that I probably don't have much to complain about.

But here are some numbers that others might find useful. I have a 32GB server. I created a data.table of 4GB size. According to gc() and also looking at top, it appeared to use about 15GB peak memory and that includes making one copy of the data. That's pretty good I think.

I wrote to disk with save(), deleted the object and used load() to remake it. This took 17 seconds and 10 seconds respectively.

I did the same with the R.cache package and this was actually slower. 23 and 14 seconds.

However both of those reload times are quite fast. The load() method gave me 357 MB/s transfer rate. By comparison, a copy took 4.6 seconds. This is a virtual server. Not sure what kind of storage it has or how much that read speed is influenced by the cache.

Dave31415
  • 2,846
  • 4
  • 26
  • 34

3 Answers3

5

Very true: data.table hasn't got to on-disk tables yet. In the meantime some options are :

  • Don't exit R. Leave it running on a server and use svSocket to evalServer() to it, as the video on the data.table homepage demonstrates. Or the other similar options you mentioned.

  • Use a database for persistency such as SQL or any other noSQL database.

  • If you have large delimited files then some people have recently reported that fread() appears (much) faster than load(). But experiment with compress=FALSE. Also, we've just pushed fwrite to the most current development version (1.9.7, use devtools::install_github("Rdatatable/data.table") to install), which has some reported write times on par with native save.

  • Packages ff, bigmemory and sqldf, too. See the HPC Task View, the "Large memory and out-of-memory data" section.

In enterprises where data.table is being used, my guess is that it is mostly being fed with data from some other persistent database, currently. Those enterprises probably :

  • use 64bit with say 16GB, 64GB or 128GB of RAM. RAM is cheap these days. (But I realise this doesn't address persistency.)

The internals have been written with on-disk tables in mind. But don't hold your breath!

Community
  • 1
  • 1
Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
3

If you really need to exit R for some strange reasons between the computation sessions and the server is not restarted, then just make a 4 GB ramdisk in RAM and store the data there. Loading the data from RAM to RAM would be much faster compared to any SAS or SSD drive :)

This can be solved pretty easily on Linux with something like adding this line to /etc/fstab:

none  /data  tmpfs  nodev,nosuid,noatime,size=5000M,mode=1777  0  0
daroczig
  • 28,004
  • 7
  • 90
  • 124
3

Depending on how your dataset looks like, you might consider using package ff. If you save your dataset as an ffdf, it will be stored on disk but you can still access the data from R. ff objects have a virtual part and a physical part. The physical part is the data on disk, the virtual part gives you information about the data.

To load this dataset in R, you only load in the virtual part of the dataset which is a lot smaller, maybe only a few Kb, depending if you have a lot of data with factors. So this would load your data in R in a matter of milliseconds instead of seconds, while still having access to the physical data to do your processing.