Stack overflow on std::array class

Question

I used to do some C# and Java, but lately I wanted to learn c++ as well as I found out many large company prefer c++ for its efficiency.

Trying to get used to the syntax and behaviour of C++, I started to translate the test I done in Java to C++ code, and the following is one of them:

#include "stdafx.h"
#include <iostream>
#include <array>
#include <string>
#include <cstring>
#include <chrono>
using namespace std;
using namespace std::chrono;

typedef array<array<int, 1000>, 1000> aArray;

aArray hasLocality(aArray, aArray);
aArray noLocality(aArray, aArray);

static aArray a = aArray();
static aArray b = aArray();

int main() {

    for (size_t i = 0; i < 100; i++)
    {
        for (size_t j = 0; j < 100; j++)
        {
            a[i][j] = i + j;
            b[i][j] = i + j;
        }
    }
    hasLocality(a, b);
    noLocality(a, b);
    system("pause");
    return 0;
}

aArray hasLocality(aArray a, aArray b) {
    milliseconds startTime = duration_cast<milliseconds>(
        system_clock::now().time_since_epoch()
        );
    aArray ans = aArray();
    for (size_t i = 0; i < ans.size(); i++)
    {
        for (size_t k = 0; k < ans[0].size(); k++)
        {
            for (size_t j = 0; j < ans[0].size(); j++)
            {
                ans[i][j] = ans[i][j] + a[i][k] * b[k][j];
            }
        }
    }

    milliseconds endTime = duration_cast<milliseconds>(
        system_clock::now().time_since_epoch()
        );
    string time = std::to_string((endTime - startTime).count()) + "\n";
    cout.write(time.c_str(), (unsigned)strlen(time.c_str()));
    return ans;
}

aArray noLocality(aArray a, aArray b) {
    milliseconds startTime = duration_cast<milliseconds>(
        system_clock::now().time_since_epoch()
        );
    aArray ans = aArray();
    for (size_t i = 0; i < ans.size(); i++)
    {
        for (size_t j = 0; j < ans[0].size(); j++)
        {
            for (size_t k = 0; k < ans[0].size(); k++)
            {
                ans[i][j] = ans[i][j] + a[i][k] * b[k][j];
            }
        }
    }
    milliseconds endTime = duration_cast<milliseconds>(
        system_clock::now().time_since_epoch()
        );
    string time = std::to_string((endTime - startTime).count()) + "\n";
    cout.write(time.c_str(), (unsigned)strlen(time.c_str()));
    return ans;
}

This is one of my test on locality by doing simple matrix multiplication, however I can't get rid of stack overflow exception because of the excessive large array size, which I think is essential for the test.

I also thought the array would be placed on heap instead of stack as I put it as static.

At last, I found that the noLocality is more efficient than hasLocality when given a smaller array size (100, 100), was that abnormal or just insufficient amount of data for locality to take place?

Thanks in advance

You are passing it by value, so a copy is made on the stack. — Raymond Chen, Jan 29 '18 at 03:12
@NathanOliver: The problem here is that a vector of vectors has very little locality of reference. You'd have 1000 chunks of 1000 integers. A modern computer has no problem with one chunk holding a million integers, just as long as it's not on the stack. — MSalters, Jan 29 '18 at 15:41
@mslaters yes, nested vectors would not be good but you can store the 2d structure flatly and maintain the locality. I would definitely hide that with an abstraction though. — NathanOliver, Jan 29 '18 at 18:48
@NathanOliver I did considered using a vector(which might be the right way), but I don't want to "avoid" this error without knowing what is causing it. — Noobnewbier, Jan 29 '18 at 19:02

eerorika · Answer 1 · 2018-01-29T19:08:14.170

The stack size is limited (usually about one to few megabytes on desktop systems). Big objects, such as instances of array<array<int, 1000>, 1000> can easily overflow this limit. Allocate big objects dynamically or statically to avoid the overflow.

I also thought the array would be placed on heap instead of stack as I put it as static.

ans is not static. It will be placed on the stack.

At last, I found that the noLocality is more efficient than hasLocality when given a smaller array size (100, 100), was that abnormal or just insufficient amount of data for locality to take place?

The array is quite likely too small to show effects of cache locality because 100*100 int array is small enough to fit within a L1 cache entirely (assuming Pentium Dual-Core or later). Thus the order in which it's read makes no difference.

I think the OP was talking about "excessive large array size", not about stack allocation, when saying "which I think is essential for the test". — Ove, Jan 29 '18 at 03:24
@Ove you're right I was talking about excessive large array size ;D — Noobnewbier, Jan 29 '18 at 19:06

score 4 · Accepted Answer · answered Jan 29 '18 at 02:56

4

In Java, all object arguments are passed by reference. In C++, the default is that they are passed by value (ie a copy is placed on the stack). So when you call:

aArray hasLocality(aArray a, aArray b)

You end up with a copy of a on the stack followed by a copy of b, and you also have space allocated for the return value, another copy of an aArray. You have 3 massive arrays allocated on the stack. This is different from Java.

In C++ you can avoid passing by value by using references or pointers. Java does not have pointers. Java references are not quite the same as C++ references, but there are similarities.

So if you have:

aArray hasLocality(aArray &a, aArray &b)

then you end up with something similar to Java, the arrays are passed by reference, same as in Java. Calls to hasLocality are no different. So just change hasLocality and noLocality in this way. You still have the return value copy. To avoid that, one thing you can do is pass in the return value as well:

void hasLocality(aArray &a, aArray &b, aArray &ans)

and then move

aArray ans = aArray();

outside the function.

At that point you'd have no array copying going on, just like Java. But do keep in mind references in C++ are a bit different, once a reference refers to an object, it cannot refer to any other object. Since you're new to C++ this might confuse you right now, but you'll learn. Note that C++ is more complicated than Java overall.

answered Jan 29 '18 at 02:56

Sean F

4,344
16
30

2

void method returning value via output reference (`aArray &ans`) is bad habit. Definitely not what I would recommend to a newcomer. – Jaa-c Jan 29 '18 at 09:12
@Jaa-c Yes, but I am not here to teach all of C++, returning a pointer or a reference requires knowledge of memory management that is alien to Java programmers, while returning a copy is what we are trying to avoid here. So this was the easiest suggestion to solve his problem for now while he learns the language. You do not help beginners by overloading them with too much info at one time. – Sean F Jan 29 '18 at 15:30
Thanks for the easy approach from @SeanF, but if possible could you also give me a slight hint of doing it the right way? – Noobnewbier Jan 29 '18 at 19:05
There is more than one right way, but the basic idea is that you would generally want to return the result rather than pass it in as an "out" parameter which I suggested https://stackoverflow.com/questions/6900035/in-out-parameters-and-how-to-work-with-them-in-c. The problem with using a return value is that you have to manage the memory for it. This is different from Java. – Sean F Jan 29 '18 at 20:23
If you allocate it on the stack inside the function then you must copy it, but in your case we are trying to avoid the copy. If you allocate on the heap, you can return a pointer to it, but the caller must know if and when to deallocate it: https://stackoverflow.com/questions/3145799/how-to-delete-a-pointer-after-returning-its-value-inside-a-function You could also use a field or a global that does not need to be allocated or deallocated, but that is also not good practice in general (although exceptions exist). – Sean F Jan 29 '18 at 20:28
Or, there are some nifty classes like the ones mentioned in another answer: std::shared_ptr or std::unique_ptr, these handle the deallocation for you. You can try them all and you can read about the reasons why some are considered better than others. In Java, there are no such issues, an object is always passed by reference and deallocation is always handled by GC. – Sean F Jan 29 '18 at 20:29
One other option is to use certain data structures which are copied, but the underlying data is not copied (std::vector is useful for this). There are many options and lots of debate about what is good and what is bad practice. Some of the underlying issues are that you want to ensure memory is never leaked, you want your code to remain modular and to avoid side-effects (a drawback of using global or static memory), you want to avoid excessive copying (your code shows how this can be bad), and you want your code to be readable and maintainable (a drawback of using in/out parameters). – Sean F Jan 29 '18 at 20:42
One of the reasons Java was invented was to eliminate some of the difficulties and issues with managing memory in a language like C++. Many newer languages (java, js, go, python, etc) handle memory allocation and deallocation for you so that it is harder for you to screw it up. – Sean F Jan 29 '18 at 20:49

Francis Cugler · Answer 3 · 2018-01-29T10:46:00.667

After reading your question and looking at your attempt of c++ code implementations; I have re-written your code slightly for several reasons that will be listed through out this answer. Before I show you what I have done let's begin with what you have mentioned and then we will go from there:

I also thought the array would be placed on heap instead of stack as I put it as static.

In c++ static storage is not the same as the heap. In order to use the heap you have to use the key words new & delete or their array versions new[] & delete[] respectively however in modern c++ they are both frowned upon unless if they are absolutely necessary & you know exactly what you are doing and how to use them properly, otherwise it is wiser to use std::shared_ptr<T> or std::unique_ptr<T> to make your life and the user's of your code lives' better as these are called smart pointers. They manage the allocation & de-allocation of the heap in a clean and safe manner even though they are not 100% bullet proof they are fairly easy and great to use and are preferred over the other methods of memory management. There is one other way to access the heap in C++ and that is through a C standard library as these functions are malloc() & free() from the header <cstdlib>, but they are ill advised against and for good reasons too. See this answer here: stack: The c++ Memory Model.

I made a few changes to your code by:

Adding a simple class for doing time execution profiling as it makes life easier.

The ExcecutionTimer class is a header only class and this class makes the rest of your code easier to read by removing duplicate code.

Removing the using "namespace" from the global scope.

It will save you headaches in the future.

I changed your function(s)' declarations-definitions.

Appropriate Containers, Easier to Use, A Little More Efficient, Maintainability, Reusability, And Readability.

Finally I used std::vector<T> instead of std::array<size, T>

For similar reasons mentioned above

Okay so now onto my take of your modified code:

ExecutionTimer.h

#ifndef EXECUTION_TIMER_H
#define EXECUTION_TIMER_H

#include <chrono>
#include <type_traits>
#include <sstream>
#include <iostream>

template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
    using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
                                     std::chrono::high_resolution_clock,
                                     std::chrono::steady_clock>;
private:
    const Clock::time_point mStart = Clock::now();

public:
    ExecutionTimer() = default;
    ~ExecutionTimer() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Destructor Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart ).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }

    inline void stop() {
        const auto end = Clock::now();
        std::ostringstream strStream;
        strStream << "Stop Elapsed: "
            << std::chrono::duration_cast<Resolution>(end - mStart).count()
            << std::endl;
        std::cout << strStream.str() << std::endl;
    }
};

#endif // !EXECUTION_TIMER_H

main.cpp

#include <vector>
#include <iostream>

#include "ExecutionTimer.h"

// Don't Like Magic Numbers So I Set A Constant Unsigned For The Vectors Size
const unsigned VEC_SIZE = 200;

// Using std::vector<T> instead of std::array<val, T>
// However 2 typedefs are needed to set both sizes. 
typedef std::vector<int> Ints;
typedef std::vector<Ints> VecInts;

// Passing by const reference for both input vectors. 
// Moved the return to a reference param and defaulted it. 
// Changed the functions return types to void. 
// Finally, I added a defaulted size parameter.
void hasLocality( const VecInts& A, const VecInts& B, 
                  VecInts& Ans = VecInts(VEC_SIZE, Ints(VEC_SIZE)), 
                  const unsigned size = VEC_SIZE );
void noLocality( const VecInts& B, const VecInts& B, 
                 VecInts& Ans = VecInts(VEC_SIZE, Ints(VEC_SIZE)), 
                 const unsigned size = VEC_SIZE );

int main() {
    try {
        // Create vectors a & b that have 1000 vectors of 1000 ints.
        static VecInts A( VEC_SIZE, Ints( VEC_SIZE ) );
        static VecInts B( VEC_SIZE, Ints( VEC_SIZE ) );

        // If you need the values returned by the functions make local copy
        // here as they will be passed by reference.
        VecInts AnsHasLoc( VEC_SIZE, Ints( VEC_SIZE ) );
        VecInts AnsNoLoc( VEC_SIZE, Ints( VEC_SIZE ) );

        // changed `std::size_t to just unsigned as you are only doing array indexing on these vectors
        for ( unsigned i = 0; i < 100; i++ ) {
             for ( unsigned j = 0; j < 100; j++ ) {
                 A[i][j] = i + j;
                 B[i][j] = i + j;
             }
         }

         // Last 2 parameters are defaulted and omitted. 
         // The second to last is a return value by reference.
         // The third is the internal size of the vectors
         hasLocality( A, B ); 
         noLocality( A, B ); 

         // Same as above but passing in the return values by reference,
         // still leaving the size defaulted.
         hasLocality( A, B AnsHasLoc );
         noLocality( A, B, AnsNoLoc );

    } catch ( std::exception e ) {
        std::cout << e.what() << std::endl;

        std::cout << "\nPress any key and enter to quit." << std::endl;
        char q;
        std::cin >> q;
        return -1;
    }

    std::cout << "\nPress any key and enter to quit." << std::endl;
    char q;
    std::cin >> q;
    return 0;      
}

void hasLocality( const VecInts& A, const VecInts& B, 
                     VecInts& Ans, const unsigned size ) {
    ExecutionTimer<> timer; // default template param = milliseconds

    // No need to declare local stack temp copy for return value.
    // Return it by reference from parameter list.
    // VecInts Ans( size, Ints( size ) ); 
    for ( unsigned i = 0; i < Ans.size(); i++ ) {
        for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
            for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
                Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
            }
        }
    }

    timer.stop();
    // return Ans; // No need to return local stack copy.
}

void noLocality( const VecInts& A, const VecInts& B, 
                    VecInt& Ans, const unsigned size ) {
    ExecutionTimer<> timer; // default in milliseconds

    // No need to declare local stack temp copy for return value; 
    // Return it by reference from parameter list.
    // VecInts Ans( size, Ints( size ) );
    for ( unsigned i = 0; i < Ans.size(); i++ ) {
        for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
            for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
                Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
            }
        }
    }

    timer.stop();
    // return Ans; // No need to return local stack copy
}

Console Output

Possible output when I set VEC_SIZE = 200

Stop Elapsed: 22733

Destructor Elapsed: 22734

Stop Elapsed: 22499

Destructor Elapsed: 22500

Press any key and enter to quit.

Possible output when I set VEC_SIZE = 100

Stop Elapsed: 2909

Destructor Elapsed: 2910

Stop Elapsed: 2815

Destructor Elapsed: 2816

Press any key and enter to quit.

Note: - I'm running this on Windows 7 Home Premium - 64bit with 8GB Ram on an Intel Quad Core Extreme 3Ghz. I am using Visual Studio 2017 CE with the standard c++17 enabled. I'm running this in debug x64 mode, with all basic default compiler optimizations.

Now when I set VEC_SIZE = 500 or VEC_SIZE = 1000 I do not get any crashes but the time execution does blow up and I would have to wait for 5 - 10 minutes for it to finish execution.

Take into consideration that these vectors are of static storage and are not on the heap.

If you're wanting to use the heap you can use std::shard_ptr<T> for shared resources or std::unique_ptr<T> for an owned resource.

For example the first time usage with VEC_SIZE = 200 ran for about 22500ms but going from 200 up to 250 alone the time elapsed output in my console nearly doubled at about 43600ms & 44000ms for the two functions. 500² & 1000² elements and the time execution blows up. Just something you need to be aware of.

Now since these vector of vectors stores int where a typical modern int today is normally 4 bytes (not guaranteed - OS/Architecture/Compiler dependent) and we know that 1000 x 1000 = 1000000 at 4 Bytes each; this will cause that vector to be about 4 million bytes or 4MB in size. So this is close to exceeding the limit of the stack and or page file. Anything larger than this I would high recommend either:

Use the heap where it is suitable.

Splitting up this container into smaller buffers or partitions

Do a batch & queue-dequeue type processing

Use multi-threading and or parallel programming.

Template Meta Programming

Or any combination of the above.

An example of using the heap is basically the same exact code above except for the use of smart pointers.

#include <vector>
#include <memory>
#include <iostream>

const unsigned VEC_SIZE = 250;

typedef std::vector<int> Ints;
typedef std::vector<Ints> VecInts;

// functions are the same & have not changed.

int main() {

    try {
        // Uncomment the key word to make these shared pointers static.
        /*static*/ std::shared_ptr<VecInts> A =
            std::make_shared<VecInts>( VecInts( VEC_SIZE, Ints( VEC_SIZE ) ) ); 
        /*static*/ std::shared_ptr<VecInts> B =
            std::make_shared<VecInts>( VecInts( VEC_SIZE, Ints( VEC_SIZE ) ) );

        // The shared pointers above are only of the outer container itself
        // Another possible way would be to make the internal vectors hold `shared_ptrs`
        // this would involve having to change the typedefs above.   


        // If needing the values returned by functions create local copy here
        // as they will be passed by reference
        VecInts AnsHasLoc( VEC_SIZE, Ints( VEC_SIZE ) );
        VecInts AnsNoLoc( VEC_SIZE, Ints( VEC_SIZE ) );

        // Now To Use The Smart Pointers 
        for ( unsigned i = 0; i < VEC_SIZE; i++ ) {
            for ( unsigned j = 0; j < VEC_SIZE; j++ ) {
                // Need To Dereference Them
                (*A)[i][j] = i + j;
                (*B)[i][j] = i + j;
            }
        }

        // Need To Dereference Them & Using the local copies passed by reference
        hasLocality( *A, *B, AnsHasLoc  );
        noLocality( *A, *B, AnsNoLoc );

    } catch ( ... )  {
        // message
    }
    // message
    return 0;
 }

If using a vector of shared_ptrs instead of a shared_ptr of a vectors the typedefs might look like this:

typedef std::vector<std::shared_ptr<int>> IntsPtr;
typedef std::vector<std::shared_ptr<Ints>> VecIntsPtr;

And the code to set the values is not all that difficult once you get use to using smart pointers. If you have c++11, c++14 or c++17 available to you, you can even use the newer features of std::vector and other standard containers with the new std::move() or std::forward<T>() semantics by using std::vector's emplace_back() function instead of using push_back(). It will invoke any objects constructor using perfect forwarding, and this method works well with smart pointers, but care needs to be taken as it can cause more problems if not done correctly compared to the well known push_back() that is safer so to speak. You can read this Stack Q/A: emplace_back() and perfect forwarding.

Now when I ran the code above with the smart pointers in both cases with the smart pointers being declared as local (stack), or with static storage... I did get about the same results with the internal size of the vectors being 250 x 250 in size at about 43753ms & 44604ms for both functions with static shared_ptr<> & 43936ms & 44008ms for local-stack shared_ptr<>.

A small optimization can be done within your two functions inside the triple for loops. You can change these two lines in your triple for loops respectively:

// Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j]; // To
Ans[i][j] += A[i][k] * B[k][j];

When I did this: with VEC_SIZE = 250 the ExecutionTimer reported about 34679ms & 34979ms with the non static shared_ptr<> version. It shaved off about 10000ms of time just by using the += operator as opposed to expanding it out.

Now as for your other question:

At last, I found that the noLocality is more efficient than hasLocality when given a smaller array size (100, 100), was that abnormal or just insufficient amount of data for locality to take place?

When doing bench mark testing it varies on your code, your compiler & linker configuration(s), your debugger configuration(s), optimization(s), and even your machine's running OS with all of its back ground processes that are currently active in ram. I ran a few tests and I was getting vary similar results each time for the different sizes of the vectors. I did not notice a large discrepancy in ExecutionTime between the two methods, but the one does seem to run maybe a few seconds slower than the other when there are more than 200² elements.

As for the testing of your methods execution time I modeled them above as you had them except replacing them with a class of mine. Now personally what I would do for this kind of test is I would move the ExceutionTimer out of these functions so that the functions do their own task, then where ever you are calling these functions wrap a timer object around it. Your functions definitions would look like this:

void hasLocality( const VecInts& A, const VecInts& B, VecInts& Ans, const unsigned size ) {
    for ( unsigned i = 0; i < Ans.size(); i++ ) {
        for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
            for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
                Ans[i][j] += A[i][k] * B[k][j];             
            }
        }
    }
}

void noLocality( const VecInts& A, const VecInts& B, VecInts& Ans, unsigned size ) {
    for ( unsigned i = 0; i < Ans.size(); i++ ) {
        for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
            for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
                // Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
                Ans[i][j] += A[i][k] * B[k][j];
            }
        }
    }
}

And to bench mark test them would be like this:

ExecutionTimer<> timerA;
hasLocality( A, B, AnsHasLoc );
timerA.stop();

ExecutionTimer<> timerB;
noLocality( A, B, AnsNoLoc );
timerB.stop();

This way your internal function definitions or implementation doesn't have embedded unrelated dependent code in its body. The code is more maintainable, reusable, cleaner and has better readability.

Feel free to test the code above, play around with it to see what some of the differences are and welcome to modern c++!

I honestly think your answer is equally good as the accepted answer, but for any future reader coming across this I would definitely recommend this. — Noobnewbier, Jan 29 '18 at 19:18

Stack overflow on std::array class

3 Answers3

Linked

Related