Problem context
I am currently analyzing and improving the performance of a computation-intensive, concurrent application written in C++ with the OpenMP parallel programming model. I have seen with profiling tools that a specific parallel region of the code drops the frequency to 200MHz, meaning that a very high amount of cycles are spent on system time or with CPUs idle. I have identified that the cause of this issue is a high number of memory allocations performed concurrently, causing the allocator to synchronize the threads and losing a lot of time waiting.
These memory allocations, though, are the result of a vector<double>
operator overloading that is heavily used during the parallel loop in question (from now on region of interest). The function of the operator overloading is as follows:
std::vector<double> operator+( const std::vector<double>& v1 , const std::vector<double>& v2 )
{
std::vector<double> v = v1;
for( unsigned int i=0; i < v1.size() ; i++ )
{ v[i] += v2[i]; }
return v;
}
This is just an example, as the same is done with the other arithmetic operators and also with operations with a vector and a scalar (double
type). As you can see, a new vector is initialized and then returned as a result of the operation. This causes at least one malloc and one free. But as I said, this is called several times during one iteration of the region of interest, and this loop runs for a high number of iterations on a high number of parallel threads (48, at most). One example of this operation call is the following:
std::vector<double> corner_point= 0.5*(my_voxel_center+other_voxel_center);
In this case, two operations are done one after the other (+
of vectors and then *
of vector and scalar) and then the result is assigned to a newly created vector.
Question
So, my question is the following: as we have seen how bad this performs, which should be the best practice on operators overloading, specifically on types like vector<>
, to avoid allocating and freeing a new vector every time this is called? Is there a better way to write that?
I have read the "https://stackoverflow.com/questions/4421706/what-are-the-basic-rules-and-idioms-for-operator-overloading" post searching for help, but there are no comments on the specific use of memory inside the overloading functions, and how these perform on concurrent applications.
I am aware that maybe there is no other way to that, in this case: where should I focus my attention to solve this issue? My thoughts are on:
- Using another allocator that handles concurrency better.
- Not using operator overloading at all to avoid allocating these temporal vectors, and perform the operation with a loop every time this appears at the code. In this case, code size growth should not be a problem as I said, this application is computation critical and this is the thing that matters the most.