1

I have read about various big data packages with R. Many seem workable except that, at least as I understand the issue, many of the packages I like to use for common models would not be available in conjunction with the recommended big data packages (for instance, I use lme4, VGAM, and other fairly common varieties of regression analysis packages that don't seem to play well with the various big data packages like ff, etc.).

I recently attempted to use VGAM to do polytomous models using data from the General Social Survey. When I tossed some models on to run that accounted for the clustering of respondents in years as well as a list of other controls I started hitting the whole "cannot allocate vector of size yadda yadda..." I've tried various recommended items such as clearing memory out and using matrices where possible to no good effect. I am inclined to increase the RAM on my machine (actually just buy a new machine with more RAM), but I want to get a good idea as to whether that will solve my woes before letting go of $1500 on a new machine, particularly since this is for my personal use and will be solely funded by me on my grad student budget.

Currently I am running a Windows 8 machine with 16GB RAM, R 3.0.2, and all packages I use have been updated to the most recent versions. The data sets I typically work with max out at under 100,000 individual cases/respondents. As far as analyses go, I may need matrices and/or data frames that have many rows if for example I use 15 variables with interactions between factors that have several levels or if I need to have multiple rows in a matrix for each of my 100,000 cases based on shaping to a row per each category of some DV per each respondent. That may be a touch large for some social science work, but I feel like in the grand scheme of things my requirements are actually not all that hefty as far as data analysis goes. I'm sure many R users do far more intense analyses on much bigger data.

So, I guess my question is this - given the data size and types of analyses I'm typically working with, what would be a comfortable amount of RAM to avoid memory errors and/or having to use special packages to handle the size of the data/processes I'm running? For instance, I'm eye-balling a machine that sports 32GB RAM. Will that cut it? Should I go for 64GB RAM? Or do I really need to bite the bullet, so to speak, and start learning to use R with big data packages or maybe just find a different stats package or learn a more intense programming language (not even sure what that would be, Python, C++ ??). The latter option would be nice in the long run of course, but would be rather prohibitive for me at the moment. I'm mid-stream on a couple of projects where I am hitting similar issues and don't have time to build new language skills all together under deadlines.

To be as specific as possible - What is the max capability of 64 bit R on a good machine with 16GB, 32GB, and 64GB RAM? I searched around and didn't find clear answers that I could use to gauge my personal needs at this time.

user2800929
  • 31
  • 1
  • 5
  • 1
    You could test you actual memory demand with an Amazon EC2 instance. – Roland Jan 24 '14 at 14:47
  • Just one quick comment: at least in principle, `lme4` shouldn't stress too much on problems with 100,000 observations. The `InstEval` example from the `MEMSS` package has 73K observations, and `lme4` runs fine on a non-particularly-large laptop: https://github.com/lme4/lme4/issues/150 (on the other hand, your fixed effect models are somewhat more complex ...) – Ben Bolker Jan 24 '14 at 14:47
  • ... and if you need to use `lme` with the `corARMA` structure, you can fill your RAM real fast. :( – Roland Jan 24 '14 at 14:54
  • How much RAM you need really depends on your modelling. model.matrix can consume quite some RAM depending on how many factor levels you have. If you do clustering - it depends on how large is the distance matrix. In general - you should understand the computational technique behind the statistical model to understand your RAM requirements. –  Jan 24 '14 at 15:43
  • Thanks for the assists. I've been looking into Amazon EC2 and it seems promising. I hadn't thought to hop on there and run things to check memory needs, but that would be a great way to gauge things before building a new machine. I've noticed also on some models that I can run the model fine, but when I start running additional code to take results from the model (say the coefs, etc.) and build custom formatted tables for output to .csv (to share with colleagues who don't use R) I hit memory issues there and need to do some cleanup before the manipulations. – user2800929 Jan 27 '14 at 17:49
  • Another interesting problem I had was this - I wrote a script to run a couple dozen models and at the end saved an environment to file. Later, I booted up and tried to grab that environment and R was unable to open it, giving the "vector of size..." memory errors. That's manageable obviously by saving smaller environments, but the few models I've run where the model actually throws the memory errors is a huge set back... – user2800929 Jan 27 '14 at 17:51

1 Answers1

5

A general rule of thumb is that R roughly needs three times the dataset size in RAM to be able to work comfortably. This is caused by the copying of objects in R. So, divide your RAM size by three to get a rough estimate of your maximum dataset size. Then you can look at the type of data you use, and choose how much RAM you need.

Of course, R can also process data out-of-memory, see the HPC task view. This earlier answer of mine might also be of interest.

Community
  • 1
  • 1
Paul Hiemstra
  • 59,984
  • 12
  • 142
  • 149
  • Although, the size of memory needed for matrices commonly created by some regression functions may increase quadratic or (much) worse with datasize and number of predictors. – Roland Jan 24 '14 at 14:44