You can split the vector into chunks for each thread to be filled with std::fill
:
#pragma omp parallel
{
auto tid = omp_get_thread_num();
auto chunksize = v.size() / omp_get_num_threads();
auto begin = v.begin() + chunksize * tid;
auto end = (tid == omp_get_num_threads() -1) ? v.end() : begin + chunksize);
std::fill(begin, end, 0);
}
You can further improve it by rounding chunksize
to the nearest cacheline / memory word size (128 byte = 32 int
s). Assuming that v.data()
is aligned similarly. That way, you avoid any false sharing issues.
On a dual socket 24 core Haswell system, I get a speedup of somewhere near 9x: 3.6s for 1 thread, to 0.4s for 24 threads, 4.8B ints = ~48 GB/s, the results vary a bit and this is not a scientific analysis. But it is not too far off the memory bandwidth of the system.
For general performance, you should be concerned about dividing your vector not only for this operation, but also for further operations (be it read or write) the same way if possible. That way, you increase the chance that the data is actually in cache if you need it, or at least on the same NUMA node.
Oddly enough, on my system std::fill(..., 1);
is faster than std::fill(..., 0)
for a single thread, but slower for 24 threads. Both with gcc 6.1.0 and icc 17.0.1. I guess I'll post that into a separate question.