After reading your question and looking at your attempt of c++ code implementations; I have re-written your code slightly for several reasons that will be listed through out this answer. Before I show you what I have done let's begin with what you have mentioned and then we will go from there:
I also thought the array would be placed on heap instead of stack as I put it as static.
In c++
static
storage is not the same as the heap
. In order to use the heap you have to use the key words new
& delete
or their array versions new[]
& delete[]
respectively however in modern c++
they are both frowned upon unless if they are absolutely necessary & you know exactly what you are doing and how to use them properly, otherwise it is wiser to use std::shared_ptr<T>
or std::unique_ptr<T>
to make your life and the user's of your code lives' better as these are called smart pointers
. They manage the allocation & de-allocation of the heap in a clean and safe manner even though they are not 100%
bullet proof they are fairly easy and great to use and are preferred over the other methods of memory management. There is one other way to access the heap in C++
and that is through a C standard library
as these functions are malloc()
& free()
from the header <cstdlib>
, but they are ill advised against and for good reasons too. See this answer here: stack: The c++ Memory Model.
I made a few changes to your code by:
- Adding a simple class for doing
time execution profiling
as it makes life easier.
- The
ExcecutionTimer
class
is a header only class
and this class
makes the rest of your code easier to read by removing duplicate code.
- Removing the
using "namespace"
from the global scope
.
- It will save you headaches in the future.
- I changed your
function(s)'
declarations-definitions
.
- Appropriate Containers, Easier to Use, A Little More Efficient, Maintainability, Reusability, And Readability.
- Finally I used
std::vector<T>
instead of std::array<size, T>
- For similar reasons mentioned above
Okay so now onto my take of your modified code:
ExecutionTimer.h
#ifndef EXECUTION_TIMER_H
#define EXECUTION_TIMER_H
#include <chrono>
#include <type_traits>
#include <sstream>
#include <iostream>
template<class Resolution = std::chrono::milliseconds>
class ExecutionTimer {
public:
using Clock = std::conditional_t<std::chrono::high_resolution_clock::is_steady,
std::chrono::high_resolution_clock,
std::chrono::steady_clock>;
private:
const Clock::time_point mStart = Clock::now();
public:
ExecutionTimer() = default;
~ExecutionTimer() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Destructor Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart ).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
inline void stop() {
const auto end = Clock::now();
std::ostringstream strStream;
strStream << "Stop Elapsed: "
<< std::chrono::duration_cast<Resolution>(end - mStart).count()
<< std::endl;
std::cout << strStream.str() << std::endl;
}
};
#endif // !EXECUTION_TIMER_H
main.cpp
#include <vector>
#include <iostream>
#include "ExecutionTimer.h"
// Don't Like Magic Numbers So I Set A Constant Unsigned For The Vectors Size
const unsigned VEC_SIZE = 200;
// Using std::vector<T> instead of std::array<val, T>
// However 2 typedefs are needed to set both sizes.
typedef std::vector<int> Ints;
typedef std::vector<Ints> VecInts;
// Passing by const reference for both input vectors.
// Moved the return to a reference param and defaulted it.
// Changed the functions return types to void.
// Finally, I added a defaulted size parameter.
void hasLocality( const VecInts& A, const VecInts& B,
VecInts& Ans = VecInts(VEC_SIZE, Ints(VEC_SIZE)),
const unsigned size = VEC_SIZE );
void noLocality( const VecInts& B, const VecInts& B,
VecInts& Ans = VecInts(VEC_SIZE, Ints(VEC_SIZE)),
const unsigned size = VEC_SIZE );
int main() {
try {
// Create vectors a & b that have 1000 vectors of 1000 ints.
static VecInts A( VEC_SIZE, Ints( VEC_SIZE ) );
static VecInts B( VEC_SIZE, Ints( VEC_SIZE ) );
// If you need the values returned by the functions make local copy
// here as they will be passed by reference.
VecInts AnsHasLoc( VEC_SIZE, Ints( VEC_SIZE ) );
VecInts AnsNoLoc( VEC_SIZE, Ints( VEC_SIZE ) );
// changed `std::size_t to just unsigned as you are only doing array indexing on these vectors
for ( unsigned i = 0; i < 100; i++ ) {
for ( unsigned j = 0; j < 100; j++ ) {
A[i][j] = i + j;
B[i][j] = i + j;
}
}
// Last 2 parameters are defaulted and omitted.
// The second to last is a return value by reference.
// The third is the internal size of the vectors
hasLocality( A, B );
noLocality( A, B );
// Same as above but passing in the return values by reference,
// still leaving the size defaulted.
hasLocality( A, B AnsHasLoc );
noLocality( A, B, AnsNoLoc );
} catch ( std::exception e ) {
std::cout << e.what() << std::endl;
std::cout << "\nPress any key and enter to quit." << std::endl;
char q;
std::cin >> q;
return -1;
}
std::cout << "\nPress any key and enter to quit." << std::endl;
char q;
std::cin >> q;
return 0;
}
void hasLocality( const VecInts& A, const VecInts& B,
VecInts& Ans, const unsigned size ) {
ExecutionTimer<> timer; // default template param = milliseconds
// No need to declare local stack temp copy for return value.
// Return it by reference from parameter list.
// VecInts Ans( size, Ints( size ) );
for ( unsigned i = 0; i < Ans.size(); i++ ) {
for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
}
}
}
timer.stop();
// return Ans; // No need to return local stack copy.
}
void noLocality( const VecInts& A, const VecInts& B,
VecInt& Ans, const unsigned size ) {
ExecutionTimer<> timer; // default in milliseconds
// No need to declare local stack temp copy for return value;
// Return it by reference from parameter list.
// VecInts Ans( size, Ints( size ) );
for ( unsigned i = 0; i < Ans.size(); i++ ) {
for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
}
}
}
timer.stop();
// return Ans; // No need to return local stack copy
}
Console Output
Possible output when I set VEC_SIZE = 200
Stop Elapsed: 22733
Destructor Elapsed: 22734
Stop Elapsed: 22499
Destructor Elapsed: 22500
Press any key and enter to quit.
Possible output when I set VEC_SIZE = 100
Stop Elapsed: 2909
Destructor Elapsed: 2910
Stop Elapsed: 2815
Destructor Elapsed: 2816
Press any key and enter to quit.
Note: - I'm running this on Windows 7 Home Premium - 64bit
with 8GB Ram
on an Intel Quad Core Extreme 3Ghz
. I am using Visual Studio 2017 CE
with the standard c++17
enabled. I'm running this in debug x64 mode
, with all basic default compiler optimizations.
Now when I set VEC_SIZE = 500
or VEC_SIZE = 1000
I do not get any crashes but the time execution does blow up and I would have to wait for 5 - 10 minutes
for it to finish execution.
Take into consideration that these vectors are of static storage and are not on the heap.
If you're wanting to use the heap you can use std::shard_ptr<T>
for shared resources or std::unique_ptr<T>
for an owned resource.
For example the first time usage with VEC_SIZE = 200
ran for about 22500ms
but going from 200
up to 250
alone the time elapsed output in my console nearly doubled at about 43600ms
& 44000ms
for the two functions. 500
2 & 1000
2 elements and the time execution blows up. Just something you need to be aware of.
Now since these vector of vectors stores int
where a typical modern int
today is normally 4 bytes
(not guaranteed - OS/Architecture/Compiler dependent) and we know that 1000 x 1000 = 1000000
at 4 Bytes
each; this will cause that vector to be about 4 million bytes
or 4MB
in size. So this is close to exceeding the limit of the stack and or page file. Anything larger than this I would high recommend either:
- Use the heap where it is suitable.
- Splitting up this container into smaller buffers or partitions
- Do a batch & queue-dequeue type processing
- Use multi-threading and or parallel programming.
- Template Meta Programming
- Or any combination of the above.
An example of using the heap is basically the same exact code above except for the use of smart pointers.
#include <vector>
#include <memory>
#include <iostream>
const unsigned VEC_SIZE = 250;
typedef std::vector<int> Ints;
typedef std::vector<Ints> VecInts;
// functions are the same & have not changed.
int main() {
try {
// Uncomment the key word to make these shared pointers static.
/*static*/ std::shared_ptr<VecInts> A =
std::make_shared<VecInts>( VecInts( VEC_SIZE, Ints( VEC_SIZE ) ) );
/*static*/ std::shared_ptr<VecInts> B =
std::make_shared<VecInts>( VecInts( VEC_SIZE, Ints( VEC_SIZE ) ) );
// The shared pointers above are only of the outer container itself
// Another possible way would be to make the internal vectors hold `shared_ptrs`
// this would involve having to change the typedefs above.
// If needing the values returned by functions create local copy here
// as they will be passed by reference
VecInts AnsHasLoc( VEC_SIZE, Ints( VEC_SIZE ) );
VecInts AnsNoLoc( VEC_SIZE, Ints( VEC_SIZE ) );
// Now To Use The Smart Pointers
for ( unsigned i = 0; i < VEC_SIZE; i++ ) {
for ( unsigned j = 0; j < VEC_SIZE; j++ ) {
// Need To Dereference Them
(*A)[i][j] = i + j;
(*B)[i][j] = i + j;
}
}
// Need To Dereference Them & Using the local copies passed by reference
hasLocality( *A, *B, AnsHasLoc );
noLocality( *A, *B, AnsNoLoc );
} catch ( ... ) {
// message
}
// message
return 0;
}
If using a vector
of shared_ptr
s instead of a shared_ptr
of a vector
s the typedef
s might look like this:
typedef std::vector<std::shared_ptr<int>> IntsPtr;
typedef std::vector<std::shared_ptr<Ints>> VecIntsPtr;
And the code to set the values is not all that difficult once you get use to using smart pointers
. If you have c++11
, c++14
or c++17
available to you, you can even use the newer features of std::vector
and other standard containers with the new std::move()
or std::forward<T>()
semantics by using std::vector
's emplace_back()
function instead of using push_back()
. It will invoke any objects constructor using perfect forwarding
, and this method works well with smart pointers
, but care needs to be taken as it can cause more problems if not done correctly compared to the well known push_back()
that is safer so to speak. You can read this Stack Q/A
: emplace_back() and perfect forwarding.
Now when I ran the code above with the smart pointers in both cases with the smart pointers being declared as local (stack), or with static storage... I did get about the same results with the internal size of the vectors being 250
x 250
in size at about 43753ms
& 44604ms
for both functions with static shared_ptr<>
& 43936ms
& 44008ms
for local-stack shared_ptr<>
.
A small optimization can be done within your two functions inside the triple for loops. You can change these two lines in your triple for loops respectively:
// Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j]; // To
Ans[i][j] += A[i][k] * B[k][j];
When I did this: with VEC_SIZE = 250
the ExecutionTimer
reported about 34679ms
& 34979ms
with the non static shared_ptr<>
version. It shaved off about 10000ms
of time just by using the += operator
as opposed to expanding it out.
Now as for your other question:
At last, I found that the noLocality is more efficient than hasLocality when given a smaller array size (100, 100), was that abnormal or just insufficient amount of data for locality to take place?
When doing bench mark testing
it varies on your code
, your compiler & linker configuration(s)
, your debugger configuration(s)
, optimization(s)
, and even your machine's running OS with all of its back ground processes that are currently active in ram. I ran a few tests and I was getting vary similar results each time for the different sizes of the vectors. I did not notice a large discrepancy in ExecutionTime
between the two methods, but the one does seem to run maybe a few seconds slower than the other when there are more than 2002 elements.
As for the testing of your methods execution time I modeled them above as you had them except replacing them with a class of mine. Now personally what I would do for this kind of test is I would move the ExceutionTimer
out of these functions so that the functions do their own task, then where ever you are calling these functions wrap a timer object around it. Your functions definitions would look like this:
void hasLocality( const VecInts& A, const VecInts& B, VecInts& Ans, const unsigned size ) {
for ( unsigned i = 0; i < Ans.size(); i++ ) {
for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
Ans[i][j] += A[i][k] * B[k][j];
}
}
}
}
void noLocality( const VecInts& A, const VecInts& B, VecInts& Ans, unsigned size ) {
for ( unsigned i = 0; i < Ans.size(); i++ ) {
for ( unsigned j = 0; j < Ans[0].size(); j++ ) {
for ( unsigned k = 0; k < Ans[0].size(); k++ ) {
// Ans[i][j] = Ans[i][j] + A[i][k] * B[k][j];
Ans[i][j] += A[i][k] * B[k][j];
}
}
}
}
And to bench mark test them would be like this:
ExecutionTimer<> timerA;
hasLocality( A, B, AnsHasLoc );
timerA.stop();
ExecutionTimer<> timerB;
noLocality( A, B, AnsNoLoc );
timerB.stop();
This way your internal function definitions or implementation doesn't have embedded unrelated dependent code in its body. The code is more maintainable, reusable, cleaner and has better readability.
Feel free to test the code above, play around with it to see what some of the differences are and welcome to modern c++
!