1

I am currently strugling with a sort of large (30M rows, 14+ columns) dataset in r using data.table package on my laptop with 8GB ram running 64-bit Win 10.

I have been hitting memory limits all day long, getting the error that R can't allocate slightly over 200MB for a vector. When I look into the Windows Task Manager, I can see, that 2-3GB of RAM are currently at use by R (or about 65% of total, including the system and some other processes). When I run the R gc() command, I get the output, that about 7800Mb out of 8012Mb is currently at use.

When I run the gc() command for second time, I can see that there was no change in the used memory thanks to previous execution of the gc.

When processing the data (i.e executing some data.table command), the process uses pretty much all installed memory and writes a thing or two to disk.

What is the reason for the difference between gc() output and what I see in task manager? Or to be more precise, why is the number in task manager lower?

ira
  • 2,542
  • 2
  • 22
  • 36

1 Answers1

1

Assuming that windows' native taskmanager only shows physical RAM statistics, it is pretty likely (and from my experience it really is) that the rest of the memory used by R (your "missing" 5-6 Gb) is allocated to the swap file by windows (which is really slow then). You could check this yourself by e. g. using process explorer which I use and also shows the virtual memory (including that on disk).

The memory allocation is done before the end of RAM, certainly with a view to protect the system from crashing. From my experience windows doesn't swap R memory at all, and at a point it's limit is just reached and you get some Error: cannot allocate vector of size 200 Mb -- see also this question.

I guess you'd like to free memory by gc() though the use of running is controversially discussed.

If you have no other machine with more RAM and don't like to upgrade your laptop's RAM you could take a look at the topic cloud computing.

I hope this might help you.

jay.sf
  • 60,139
  • 8
  • 53
  • 110
  • Thank you. Do you know if there is a way I could persuade `data.table` to use the disk a bit more during the calculation so that it doesn't need that much RAM? – ira Nov 20 '17 at 08:38
  • 1
    Nevermind, I just tried increasing the `memory.limit` to double of what I have actually installed and I suppose it works alright and uses the disk as it should, without making the system go nuts. I was afraid to do that, because I had no idea if R will actually use up to 100% of the physical RAM and then the machine would be basically unusable... – ira Nov 20 '17 at 08:56
  • 1
    You could also 1. try to [increase](http://www.thewindowsclub.com/increase-page-file-size-virtual-memory-windows) windows swap file 2. trivially trim your dataset to only the columns you need 3. save it to presumably smaller R-format with `save(mydata, file= "mydata.RData")` restart R and just load it by `load("mydata.Rdata")`. _Incidentially:_ Since R could crash without autosave you may want to save your code before running any probably RAM-intensive command. Good luck! – jay.sf Nov 20 '17 at 10:28