7

I have read several threads about memory issues in R and I can't seem to find a solution to my problem.

I am running a sort of LASSO regression on several subsets of a big dataset. For some subsets it works well, and for some bigger subsets it does not work, with errors of type "cannot allocate vector of size 1.6Gb". The error occurs at this line of the code:

example <- cv.glmnet(x=bigmatrix, y=price, nfolds=3)

It also depends on the number of variables that were included in "bigmatrix".

I tried on R and R64 for both Mac and R for PC but recently went onto a faster virtual machine on Linux thinking I would avoid any memory issues. It was better but still had some limits, even though memory.limit indicates "Inf".

Is there anyway to make this work or do I have to cut a few variables in the matrix or take a smaller subset of data ?

I have read that R is looking for some contiguous bits of memory and that maybe I should pre-allocate the matrix ? Any idea ?

Iterator
  • 20,250
  • 12
  • 75
  • 111
Emmanuel
  • 254
  • 3
  • 11

3 Answers3

8

Let me build slightly on what @richardh said. All of the data you load with R chews up RAM. So you load your main data and it uses some hunk of RAM. Then you subset the data so the subset is using a smaller hunk. Then the regression algo needs a hunk that is greater than your subset because it does some manipulations and gyrations. Sometimes I am able to better use RAM by doing the following:

  1. save the initial dataset to disk using save()
  2. take a subset of the data
  3. rm() the initial dataset so it is no longer in memory
  4. do analysis on the subset
  5. save results from the analysis
  6. totally dump all items in memory: rm(list=ls())
  7. load the initial dataset from step 1 back into RAM using load()
  8. loop steps 2-7 as needed

Be careful with step 6 and try not to shoot your eye out. That dumps EVERYTHING in R memory. If it's not been saved, it'll be gone. A more subtle approach would be to delete the big objects that you are sure you don't need and not do the rm(list=ls()).

If you still need more RAM, you might want to run your analysis in Amazon's cloud. Their High-Memory Quadruple Extra Large Instance has over 68GB of RAM. Sometimes when I run into memory constraints I find the easiest thing to do is just go to the cloud where I can be as sloppy with RAM as I want to be.

Jeremy Anglim has a good blog post that includes a few tips on memory management in R. In that blog post Jeremy links to this previous StackOverflow question which I found helpful.

Community
  • 1
  • 1
JD Long
  • 59,675
  • 58
  • 202
  • 294
  • @JD -- Better said! I'm (slowly) learning the ways. And good find on the prev SO question. – Richard Herron Jan 16 '11 at 21:29
  • Your answer was good. I just threw in some more on top as I was suspicious the OP might not grasp exactly what you were getting at. – JD Long Jan 16 '11 at 21:34
2

I don't think this has to do with continuous memory, but just that R by default works only in RAM (i.e., can't write to cache). Farnsworth's guide to econometrics in R mentions package filehash to enable writing to disk, but I don't have any experience with it.

Your best bet may be to work with smaller subsets, manage memory manually by removing variables you don't need with rm (i.e., run regression, store results, remove old matrix, load new matrix, repeat), and/or getting more RAM. HTH.

Richard Herron
  • 9,760
  • 12
  • 69
  • 116
  • 1
    Lack of RAM typically wouldn't lead to out of memory errors. The system would use paging to deal with a lack of physical RAM. You might see thrashing, but why would you see an allocation failure? – David Heffernan Jan 16 '11 at 20:50
  • @davide do you have experience with R or are you speaking out of experience with other software? R will undoubtedly fail if it runs out of RAM. – JD Long Jan 16 '11 at 20:55
  • @JD What about paging? Does R explicitly disable paging to disk?! – David Heffernan Jan 16 '11 at 21:07
  • My R installation frequently exceeds the 24GB of RAM and resorts to paging, so the statement that R does not allow paging is false on at least one OS. – IRTFM Jan 16 '11 at 21:10
  • @JD Are you perhaps talking about running out of address space when running a 32 bit version of R? Sure, paging won't help in that situation, but that it not a lack of memory, that's a shortage of address space which is a subtle difference. – David Heffernan Jan 16 '11 at 21:14
  • @David and @DWin, I very well may not grasp the subtle issues of memory management. I'm no computer scientist, that's for sure. But my experience is that if I create a data object that exceeds memory, R will fall over. Here's an example that predictably fails on my laptop (4GB of RAM, Ubuntu 64 bit). https://gist.github.com/782164 Give that a go on your boxes and let me know if your experience is different. – JD Long Jan 16 '11 at 21:39
  • @JD I don't have access to a 64 bit version of R so I can't run it. So I don't have first hand experience to go on. But I do know about virtual memory and paging, and that should mean that you get thrashing rather than failure to allocate. – David Heffernan Jan 16 '11 at 21:50
  • @JD do you have any swap space on your laptop? It sounds like R is getting killed by the OOM killer - if you add swap space your R session will just start paging data to disk (and therefore grind and go very slowly) but should eventually finish – Aaron Statham Jan 17 '11 at 02:44
  • @all -I have done some digging and I'm still confused. The guides I read (Burn's _R Inferno_, the R admin/install guide, and `help('Memory-limits')`) say that 64 bit R works in "virtual memory", which implies that R should be able to write to a page file. But I can't; I still bump against the `cannot allocate` greater than 8 gb error. I am on a Windows 7 64 bit laptop with 8 gb RAM (I think this by default has an 8 gb page file). What am I missing? Can I tell R to use a page file in Win/Mac/Linux? Is this worth a new question? I also use a Mac with 4 gb, so this could come in handy. Thanks! – Richard Herron Jan 18 '11 at 15:02
0

Try bigmemory package. It is very easy to use. The idea is that data are stored in file on HDD and you create an object in R as reference to this file. I have tested this one and it works very well.

There are some alternative as well, like "ff". See CRAN Task View: High-Performance and Parallel Computing with R for more information.

djhurio
  • 5,437
  • 4
  • 27
  • 48