C++ - Load Very Large Dataset on RAM

Question

For a project I have to deal with a dataset of 30 GB. I can use a very powerful supercomputer which would allow me to store all the dataset in the RAM memory for computing calculations on it (I would need the whole dataset for some of the algorithms I have to implement). The problem is that loading the dataset is still very slow.

I would like to ask you for practical suggestions to speed up the process. My idea was to divide the loading process into C++11 explicit threads, which would load separate chunks of data based on the thread index. I have also heard of the STXXL library, which however seems to deal with out-of-core computations, hence without loading data on the RAM (which I would like to avoid, since I have the necessary RAM — and I think I may obtain a result faster by loading the dataset on it).

Something along the lines of mmap. (http://en.wikipedia.org/wiki/Memory-mapped_file). Please tell us what you've tried that is so slow. — seanmcl, Oct 16 '13 at 14:16
Well, actually nothing so far. I have just a serial C++ code which loads the data and do something on it. — Pippo, Oct 16 '13 at 14:18
If I understood correctly, with memory-mapped files you mean a similar approach to the use of the STXXL library: avoid loading the data on the RAM and perform calculations leaving it on the hard disk. Is it right? — Pippo, Oct 16 '13 at 14:21
Try using mmap, which maps a disk file to a chunk of virtual memory. That tends to be much faster: http://stackoverflow.com/questions/45972/mmap-vs-reading-blocks — seanmcl, Oct 16 '13 at 14:21
Do you really need all of the data at once in memory? Can you change the algorithm to work on smaller subsets of data? — David Rodríguez - dribeas, Oct 16 '13 at 14:25
No. It means that when you access a memory location in your program, if that location isn't in memory it grabs the page from the disk. It may not be that much better than what you're doing, depending on your platform. Just an idea. Trying multithreaded reads seems reasonable, but getting the details of such a program correct is obviously more challenging. — seanmcl, Oct 16 '13 at 14:25
@dribeas Actually I could, but I would lose the asymptotically optimality of my algorithm. Anyway, I have to confess that I'm also trying this approach. To this aim, I'm trying C++11 explicit multithreading to load different portions of the dataset, do something on them on different cores and store the partial results in text files, to process them later. Do you have a better approach to do this? — Pippo, Oct 16 '13 at 14:30

score 0 · Answer 1 · answered Oct 16 '13 at 15:17

Profile. Find out which part of your program is taking the most time, then optimize that part. Everything else is micro-optimization.

You may want to split your program into at least 2 threads, maybe 3. Thread 1 is in charge of reading in data and placing into buffer. Thread 2 is in charge of performing computations on, including parsing, the input buffer and placing results into an output buffer. Thread 3 would take data from output buffer and display it.

You may need multiple input buffers depending on the speed of the input data. Two is sufficient, three or more to give the computation more time. The idea is that the input thread is filling one buffer while the computation thread is processing another. When computation is finished, it starts on the next buffer; similarly with the reading thread.

Your other bottleneck may be fetching data from memory. Search the web for "data cache optimization c++". This is a micro-optimization unless you are fetching and processing huge amounts of data.

C++ - Load Very Large Dataset on RAM

1 Answers1