1

The sample code below is a simplified version of my working code. In this code, writing to shared variable is done only at the last line where std::vector::push_back is called.

std::vector<struct FortyByteStruct> results;

#pragma omp parallel for num_threads(8)
for (int i = 0; i < 250; i++)
{
    struct FortyByteStruct result = some_heavy_work(i);

    #pragma omp critical
    {
        results.push_back(result);
    }
}

I was wondering if this push_back operation would result in false-sharing, giving me some chance to optimize further by getting rid of it. I've decided to do some bench tests first, before digging into this issue.

With chrono, I've measured the wall clock execution time of some_heavy_work() and the critical section separately. The latter took about 10^(-4) times of execution time of the former, so I've concluded that there would be almost no benefit from optimizing this part whether false-sharing is involved or not.

Anyway, I'm still curious whether false-sharing is an issue here. Do I have to look at the internal implementation of std::vector? Any enlightment would be greatly appreciated. (I'm on VS2015)

nglee
  • 1,913
  • 9
  • 32
  • 1
    Critical sections are more expensive than cache false sharing, since you force the thread to enter block state in the operative system if the section is contended. I would rather try to profile your piece of code first to really see if you have potential gains or you will only waste time in a negligible optimization. – Jorge Bellon Nov 14 '17 at 09:24
  • @Jorge Bellón I'm not planning to optimize. I just wanted to know if there is some possibility of false sharing. But I think your idea of operating system entering blocked state is something to think of. – nglee Nov 14 '17 at 11:27
  • Then you just can run your application with `perf stat -r $num_times ./executable` to get an idea of how many context switches your sample code is getting. – Jorge Bellon Nov 14 '17 at 17:02

2 Answers2

3

Given that your FortyByteStruct is probably smaller than a cache line (usually 64 byte), there may some false sharing when writing the results data. However, it is hardly consequential, because it will be overshadowed by the cost of the critical section - and also the "true" sharing of modifying the vector itself (not it's data). You don't need to know the details of the std::vector's implementation - only that it's data is contiguous in memory and that it's state (pointer(s) to data/size/capacity) is in the memory of the vector variable itself. False sharing is usually an issue when separate data on the same cache line is accessed by multiple threads in an unprotected fashion. Keep in mind that false sharing does not affect correctness, only performance.

A slightly different example of false sharing would be when you have a std::vector<std::vector<struct FortyByteStruct>> and each thread performs an unprotected push_back. I explained that in detail here.

In your example, with a known total size of the vector, the best approach would be to resize the vector before the loop and then just assign results[i] = result. This avoids the critical section and OpenMP typically distributes the loop iterations in such a way that there is little false sharing. Also you get a deterministic order of results.

That said, when you have have confirmed by measurement that the time is dominated by some_heavy_work, then you are fine.

Zulan
  • 21,896
  • 6
  • 49
  • 109
  • How does OpenMP distribute the loop to minimize false sharing? Is it done at compile time, where the compiler inspects the code and assign each thread which index of the vector object to modify? – nglee Dec 01 '17 at 05:17
  • @nglee The default scheduling is implementation defined. However, you may assume that the implementations use a reasonably large chunk size for whatever scheduling strategy they chose. Using interleaving (adjacent iterations go to different threads) would be a bad / unlikely default. There is no need to do complicated code-analysis for this. – Zulan Dec 01 '17 at 10:01
0

I am not an expert in std::vector implementation but I am pretty sure that it will not check if another process is writing at the same time.

Nevertheless, here is two advises:

  • Yes, the critical operation has a small overhead but it is negligible compared to the gain of performing 'some_heavy_work' in parallel (I guess...). So, in doubt, I would keep it there

  • You should check the difference between critical and atomic (openMP, atomic vs critical?).

Hope it helps

Thomas
  • 121
  • 7