C++ std::vector for HPC?

Question

I am translating a program that perform numeric simulations from FORTRAN to C++.

I have to deal with big matrices of double of the size of 800MB. This

double M[100][100][100][100];

gives a segmentation error because the stack is not so big. Using new, delete is awkward because I need four for loops to allocate my array and even to deallocate it.

std::array is in stack so it isn't good. std::vector would be a nice choice, so

First question Is std::vector good for fast simulations or a

vector<vector<vector<vector<int,100>,100>,100>,100>

would carry a lot of useless and heavy data?

Second question Do you know any data other structures that can I use? Maybe there is something from boost.

For the moment I am simply using this solution:

double * M = new double [100000000];

and I am accessing manually the entries that I need. If I don't find any other performant solution I will write a class that automatically manages this last method.

Third question Do you think that would decrease significatively the performance?

You should have a look at Boost Multidimensional Array Library: http://www.boost.org/doc/libs/1_61_0/libs/multi_array/doc/user.html — Alexey Guseynov, Sep 17 '16 at 12:47
What OS/platform are you running on? The simplest method may be to just increase your available stack memory. — Andrew Henle, Sep 17 '16 at 13:37
It is important to know that STL containers have a memory overhead (they cost more than just the stored elements), therefore nesting them (vector>) is often not a good idea. I'd recommend doing exactly what you did (single block of memory). To improve code legibility, you may pack that in a class and overload operator(int,int,int,int) to retreive the elements. — BrunoLevy, Sep 17 '16 at 13:45
@BrunoLevy Is there any standard class that already does it? — Nisba, Sep 17 '16 at 14:39
I don't know, but it is really not difficult to implement, 20 lines max. I can post an answer with it if you want (it's short but does not fit in a comment). — BrunoLevy, Sep 17 '16 at 14:47
@BrunoLevy Should I define the overloading operator as an inline method? — Nisba, Sep 18 '16 at 08:29

score 1 · Answer 1 · answered Sep 17 '16 at 14:53

You may want to consider std::valarray which was designed to be competitive with FORTRAN. It stores elements as a flat array and supports math operations, as well as operations for slicing and indirect access.

Sounds like what you're planning on anyway. Although even the manpage suggests there may be more flexible alternatives.

score 0 · Answer 2 · answered Sep 17 '16 at 13:19

Using something on the stack is definetly much more efficient. the process memory is limited by the OS (stack+heap) and your issue is that you might exceed the memory allocated to the process in most of the cases.

To resolve the memory limitation, I would suggest you have a look at stxxl. It is a library wich implements most of STL containers and algorithms but using external memory when needed. Of course this will compromise performance...

It's quite not trivial to exceed 64-bit address space "in most of cases". — Igor R., Sep 17 '16 at 13:31

score 0 · Answer 3 · edited May 23 '17 at 11:52

Programmers tend to approach every problem by first writing more code. Which then has to be maintained. Every problem is not a nail...

More code is not the simplest, most effective solution here. More code is also likely to produce an executable that's slower.

Stack memory is just memory - it's no different from heap memory. It's just managed by the process differently, and is subject to different resource limits. There's no real difference to the OS whether a process uses 1 GB of memory on its stack, or 1 GB from its heap.

In this case, the stack size limit is likely an artificial configuration setting. On a Linux system, the stack size limit can be reset for a shell and its child processes:

bash-4.1$ ulimit -s unlimited
bash-4.1$ ulimit -s
unlimited
bash-4.1$

See this question and its answers for more details.

All POSIX-compliant systems should have similar features, as the stack-size limit is a POSIX-standard resource limit.

Also, you can run a thread with an arbitrarily large stack quite easily:

#include <pthread.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <stdio.h>

void *threadFunc( void *arg )
{
    double array[1024][1024][64];
    memset( array, 0, sizeof( array ) );
    return( NULL );
}
int main( int argc, char **argv )
{
    // create and memset the stack lest the child thread start thrashing
    // the machine with "try to run/page fault/map page" cycles
    // the memset will create the physical page mappings
    size_t stackSize = strtoul( argv[ 1 ] ? argv[ 1 ] : "1073741824",
        NULL, 0 );    
    void *stack = mmap( 0, stackSize, PROT_READ | PROT_WRITE,
        MAP_PRIVATE | MAP_ANON, -1, 0 );
    memset( stack, 0, stackSize );

    // create a large stack for the child thread
    pthread_attr_t attr;    
    pthread_attr_init( &attr );
    pthread_attr_setstacksize( &attr, stackSize );
    pthread_attr_setstackaddr( &attr, stack );

    pthread_t tid;
    pthread_create( &tid, &attr, threadFunc, NULL );
    void *result;
    pthread_join( tid, &result );
    return( 0 );
}

Error checking has been omitted.

This also works if you run ulimit -s unlimited before running the compiled program (and of course if the machine has enough virtual memory...):

#include <string.h>

int main( int argc, char **argv )
{
    double array[1024][1024][64];
    memset( array, 0, sizeof( array ) );

    return( 0 );
}

*"Stack memory is just memory - it's no different from heap memory."* That's absolutely not true. How the compiler (and so the compiled program) handles those two type of memory is very different. Heap requires system calls, some checks, and are more susceptible to caches miss. Just to say a couple of things — BiagioF, Sep 17 '16 at 14:41
In terms of efficiency is it the same as memory management with new and delete? — Nisba, Sep 17 '16 at 14:47
@BiagioFesta - Once the process gets the memory, it's just memory. *How* the process gets the memory from the OS isn't likely to impact performance much for long-running and/or computationally-intensive processes. How the memory is managed internally by the process can impact performance greatly however. — Andrew Henle, Sep 17 '16 at 16:17
@LucaMarconato *In terms of efficiency is it the same as memory management with new and delete?* Stack memory can be *much* faster than `new`/`delete` - since the stack is tied to the thread running on it, access is entirely contention- and lock-free. One restriction is that local variables on the stack no longer exist once the function they're declared in returns (that's the "easy" explanation...). Another restriction for multithreaded processes is that each thread needs its own dedicated stack, so stack memory usage can be less space-efficient than using a single process-wide heap. — Andrew Henle, Sep 17 '16 at 16:22

roygvib · Answer 4 · 2016-09-18T01:51:46.353

In some cases, it might be useful to cast the 1-D pointer to a 4-D rectangular array to enable Cartesian indexing (rather than linear indexing):

#include <cstdio>

#define For( i, n ) for( int i = 0; i < n; i++ )

double getsum( double *A, int *n, int loop )
{
    // Cast to 4-D array.
    typedef double (* A4d_t)[ n[2] ][ n[1] ][ n[0] ];
    A4d_t A4d = (A4d_t) A;

    // Fill the array with linear indexing.
    int ntot = n[0] * n[1] * n[2] * n[3];
    For( k, ntot ) A[ k ] = 1.0 / (loop + k + 2);

    // Calc weighted sum with Cartesian indexing.
    double s = 0.0;
    For( i3, n[3] )
    For( i2, n[2] )
    For( i1, n[1] )
    For( i0, n[0] )
        s += A4d[ i3 ][ i2 ][ i1 ][ i0 ] * (i0 + i1 + i2 + i3 + 4);

    return s;
}

int main()
{
    int n[ 4 ] = { 100, 100, 100, 100 };

    double *A = new double [ n[0] * n[1] * n[2] * n[3] ];

    double ans = 0.0;
    For( loop, 10 )
    {
        printf( "loop = %d\n", loop );
        ans += getsum( A, n, loop );
    }
    printf( "ans = %30.20f\n", ans );
    return 0;
}

which takes 5.7 sec with g++-6.0 -O3 on Mac OSX 10.9. It might be interesting to compare the performance with that based on vector<vector<...>> or a custom array view class. (I tried the latter before, and at that time the above array casting was somewhat faster than my (naive) array class.)

score -1 · Answer 5 · answered Sep 17 '16 at 12:53

I think the answer of that question is quite opinion based. The concept of "good" strictly depend on the usage of the data structure.

Anyway if the number of elements does not change during the execution time, and your problem is practically a memory access, then the best solution, in my opinion, is a contiguous array of blocks.

Generally in those cases, my choice is a simple T* data = new T[SIZE]; encapsulated into a class which handles the access correctly.

The usage of a pointer makes me feel a more little bit comfortable about the memory control, but actually a std::vector<T> is practically the same thing.

That's all I can say from the knowledge you've provided in your question. Anyway what I can additional suggest you is to take care about the usage of the data as well.

For example: in order to maximize your performances application, try to exploit caches and to avoid miss. So you could try to understand if there are some "access-patterns" to your data, or, even, if you can scale your problem thinking a multi-thread context.

To conclude, in my opinion generally a contiguous vector of double is the best choice. That answers your question. But if you care about performance, you should think about how to exploit as best as you can caches and processor mechanisms (like multi-threading).

So you had code that worked in Fortran, and you want to rewrite it in C and it will not run... But for what purpose? You could link the working Fortran with the C... would that provide any benefit to you? — Holmz, Sep 18 '16 at 12:25

C++ std::vector for HPC?

5 Answers5