Work on data that larger than physical RAM in R using bigmemory?

Question

I am developing an R package called biglasso that fits lasso models in R for massive data sets by using memory-mapping techniques implemented in bigmemory C++ library. Specifically, for a very large dataset (say 10GB), a file-backed big.matrix is first created with memory-mapped files stored on disk. Then the model fitting algorithm accesses the big.matrix via MatrixAccessor defined in the C++ library to obtain data for computation. I assume that memory-mapping technique allows to work on data that larger than available RAM, as mentioned in the bigmemory paper.

For my package, everything works great at this point if the data size doesn't exceed available RAM. However, the code runs like forever when the data is larger than RAM, no complains, no errors, no stop. On Mac, I checked top command, and noticed that the status of this job kept switching between "sleeping" and "running", I am not sure what this means or this indicates something going on.

[EDIT:]

By "cannot finish", "run forever", I mean that: working on 18 GB of data with 16 GB RAM cannot finish for over 1.5 hours, but it could be done within 5 minutes if with 32 GB RAM.

[END EDIT]

Questions:

(1) I basically understand memory-mapping utilizes virtual memory so that it can handle data larger than RAM. But how much memory does it need to deal with larger-than-RAM objects? Is there a upper bound? Or is it decided by size of virtual memory? Since the virtual memory is like infinite (constrained by hard drive), would that mean memory-mapping approach can handle data much much larger than physical RAM?

(2) Is there a way that I can measure the memory used in physical RAM and the virtual memory used, separately?

(3) Are there anything that I am doing wrong? what are possible reasons for my problems here?

Really appreciate any feedback! Thanks in advance.

Below are some details of my experiments on Mac and Windows and related questions.

On Mac OS: Physical RAM: 16GB; Testing data: 18GB. Here is the screenshot of memory usage. The code cannot finish.

[EDIT 2]

I attached CPU usage and history here. There is just use one single core for the R computation. It's strange that the system uses 6% CPU, while User just uses 3%. And from the CPU history window, there are a lot of red area.

Question: What does this suggest? Now I suspect it is the CPU cache is filled up. Is that right? If so, how could I resolve this issue?

[END EDIT 2]

Questions:

(4) As I understand, "memory" column shows the memory used in physical RAM, while "real memory" column shows the total memory usage, as pointed out here. Is that correct? The memory used always shows ~ 2GB, so I don't understand why so much memory in RAM is not used.

(5) A minor question. As I observed, it seems that "memory used" + "Cache" must be always less than "Physical memory" (in the bottom middle part). Is this correct?

On Windows machine: Physical RAM: 8GB; Testing data: 9GB. What I observed was that as my job started, the memory usage kept increasing until hitting the limit. The job cannot finish as well. I also tested functions in biganalytics package (also using bigmemory), and found the memory blows up too.

memory maps work by saving your objects to disk when you're not using them, and reloading them when you need them. Effectively using the harddisk itself as if it were RAM. Do you know how much slower a harddisk is than actual RAM? [up to 100,000 times slower](http://stackoverflow.com/questions/1371400/). While your program waits for this memory to be saved loaded, it sleeps to avoid waste. — Mooing Duck, Mar 11 '16 at 23:14
As I mentioned below to respond @antlersoft. If I run the 18 GB data on with 32 GB RAM, then the job finishes within 5 minutes. But if I run it with 16 GB, for one experiment, it didn't finish within 1.5 hours. Why would that be so slow? I thought the whole point of memory mapping is to work as if in primary memory. https://en.wikipedia.org/wiki/Memory-mapped_file. Did I miss anything about memory-mapping technique? — SixSigma, Mar 12 '16 at 00:07
Yes, the upside of memory mapping is that works as if it were primary memory. The downside is that it's incredibly slow. As for why your 18GB went fast and the 16GB went slow, I'd speculate it's because something else was using more memory in the background. — Mooing Duck, Mar 12 '16 at 00:33
Sorry for confusion. I meant if working with 18GB data with 32GB RAM, it finished within 5 minutes; but if run with 16GB RAM, still working for 18GB data (so data larger than RAM), it would take forever. — SixSigma, Mar 12 '16 at 00:38
if "the upside of memory mapping is that works as if it were primary memory", then why does "it's incredibly slow"? If it's so slow, what's the difference between memory mapping and directly reading from disk? And then what's the point of using memory mapping? — SixSigma, Mar 12 '16 at 00:39
Mainly, Memory maps are directly addressable. Other advantages are they can be paged in and out by the operating system as needed, and can be shared between processes. You can do most of that yourself by reading and writing to disk, but a memory map abstracts all that away including doing the blocking IO in a separate process. — Mooing Duck, Mar 12 '16 at 00:49

score 2 · Answer 1 · answered Mar 11 '16 at 23:24

2

"Cannot finish" is ambiguous here. It may be that your computation will complete, if you wait long enough. When you work with virtual memory, you page it on and off disk, which is thousands to millions times slower than keeping it in RAM. The slowdown you will see depends on how your algorithm accesses memory. If your algorithm only visits each page one time in a fixed order, it might not take too long. If your algorithm is hopping all around your data structure O(n^2) times, the paging is going to slow you down so much it might not be practical to complete.

answered Mar 11 '16 at 23:24

antlersoft

14,636
4
35
55

Thanks for the comment. So if I run the 18 GB data on with 32 GB RAM, then the job finishes within 5 minutes. But if I run it with 16 GB, for one experiment, it didn't finish within 1.5 hours. And yes, the algorithm need to access the whole data a lot of times, say 1000 times. However, I am confused with your explanation. Shouldn't the whole point of memory-mapping be working as if in primary memory? https://en.wikipedia.org/wiki/Memory-mapped_file – SixSigma Mar 12 '16 at 00:03
@AaronZeng memory mapping means you can access your data programmatically _as if_ it's all in memory. That doesn't change the fact that, under the hood, it's being spooled off the disk. This is going to slow you down no matter what. – Hong Ooi Mar 14 '16 at 21:52

score 2 · Answer 2 · answered Mar 11 '16 at 23:46

2

In Windows, it might be useful to checkout TaskManager -> Performance-> Resourcemonitor -> Disk activity to see how much data is being written to disk by your process id. It can give an idea of how much data is being to Virtual memory from RAM , if write speed is becoming a bottle neck etc

answered Mar 11 '16 at 23:46

Bimal Parakkal

21
2

Thanks, I will check that. But in my experiment, I don't need to write data, just need to read large data on disk for computation. So I might not understand your point of "see how much data is being written to disk". – SixSigma Mar 12 '16 at 00:09
@AaronZeng: If you're using memory maps, Windows will write those to disk _constantly_ behind the scenes. It may also reload them from disk as needed. – Mooing Duck Mar 12 '16 at 00:34
Oh, got you! So there are still a lot of reading/writing between RAM and disk. And that would make the code much much slower. Is that right? – SixSigma Mar 12 '16 at 00:41

Work on data that larger than physical RAM in R using bigmemory?

2 Answers2