Parallel fill std::vector with zero

Question

I want to fill a std::vector<int> with zero with openmp. How to do that quickly?

I heard that looping over the vector to set each element to zero was slow, and std::fill was much faster. Is that still true now?

Fastest way to reset every value of std::vector<int> to 0

Do I have to manually divide the std::vector<int> into regions, use #pragma omp for loop over each thread, and then use std::fill in the loop?

Yes, it's still true. And yes, if you're using OpenMP you have to assign the jobs yourself. Either way you go about it, if you care about it being fast you should measure. I would suggest assigning power of 2 sized jobs (the thread referenced said it used 16 ints in an MMX register, this is probably the smallest feasible job size). Depending on the length of the vector, it might be faster to just single thread fill. Be sure to measure and find the cross over point. This is a comment, because it doesn't really answer your question well. It's just musings that you've probably had yourself. — OmnipotentEntity, Feb 04 '17 at 20:09
GCC 6.3 and Clang 3.9.0 both compile both a "loop and assign 0 everywhere" and `std::fill` into a (tail)call to `memset`. It's not exactly the same code, but the heavy lifting is the same. — harold, Feb 04 '17 at 20:14
I *bet* that filling the vector with zero is occupying a minimal part of your time. Don't worry about it until you have evidence that this is the problem area. — Martin Bonner supports Monica, Feb 04 '17 at 20:22
@MartinBonner quite true. i am working on other parts of the program now. just want to know if there is a smarter way to these steps. — hamster on wheels, Feb 04 '17 at 20:25
@rxu - in that case just write it in as simple a way as possible. Keep the complexity for where it gives you useful performance gains. (Obviously worth asking if there is a simple way to do it fast - the answer is sadly, no.) — Martin Bonner supports Monica, Feb 04 '17 at 20:31
@omni "Yes, it's still true" what horrible compiler are you using? Even the linked answer shows that memset, fill and manually setting the values is all identical performance wise — Voo, Feb 04 '17 at 21:34
@voo, The "I jumped the gun and didn't test." compiler. Which is why you should always test. Now if you excuse me I have to go eat my foot some more. — OmnipotentEntity, Feb 04 '17 at 21:36
Did not profile, but I suspect that on linux a `free` followed by `calloc` is likely to be much faster than anything else. This is because of overcommit, the actual zeroing will be done when you use a specific page for the first time. See: http://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc — sbabbi, Feb 04 '17 at 21:40
@sbabbi, that is just shifting the actual work to later. Could be good due to locality, but could also be bad for all the page faults. — Zulan, Feb 04 '17 at 22:04
The automatic memset substitution would turn over control of choice of nontemporal store to the runtime library. Multiple threads may pay off only if multiple memory controllers are active and numa placement matches the rest of the application. — tim18, Feb 04 '17 at 22:34
Related http://stackoverflow.com/questions/11576670/in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p — Z boson, Feb 06 '17 at 12:24

Zulan · Accepted Answer · 2017-02-04T21:43:45.690

7

You can split the vector into chunks for each thread to be filled with std::fill:

#pragma omp parallel
{   
    auto tid = omp_get_thread_num();
    auto chunksize = v.size() / omp_get_num_threads();
    auto begin = v.begin() + chunksize * tid;
    auto end = (tid == omp_get_num_threads() -1) ? v.end() : begin + chunksize);
    std::fill(begin, end, 0);
}

You can further improve it by rounding chunksize to the nearest cacheline / memory word size (128 byte = 32 ints). Assuming that v.data() is aligned similarly. That way, you avoid any false sharing issues.

On a dual socket 24 core Haswell system, I get a speedup of somewhere near 9x: 3.6s for 1 thread, to 0.4s for 24 threads, 4.8B ints = ~48 GB/s, the results vary a bit and this is not a scientific analysis. But it is not too far off the memory bandwidth of the system.

For general performance, you should be concerned about dividing your vector not only for this operation, but also for further operations (be it read or write) the same way if possible. That way, you increase the chance that the data is actually in cache if you need it, or at least on the same NUMA node.

Oddly enough, on my system std::fill(..., 1); is faster than std::fill(..., 0) for a single thread, but slower for 24 threads. Both with gcc 6.1.0 and icc 17.0.1. I guess I'll post that into a separate question.

edited Feb 04 '17 at 21:43

answered Feb 04 '17 at 21:25

Zulan

21,896
6
49
109

At a quick glance that doesn't seem to be the best division of calculation among the threads when the size of the vector is not divisible by the number of threads. The last thread might get less work to do than other threads. Thanks for the nice example for `omp parallel` though. I don't know that we can use tid that way, and still have omp loops over each of all threads in my code ... – hamster on wheels Feb 04 '17 at 21:37
No the last thread gets more work, but at most `nthreads - 1` more, which is negligible. Alternatively you can do `chunksize = (v.size() - 1) / nthreads + 1` which is a bit more balanced. But I would argue (going go edit that), it is actually more important to align the chunks. – Zulan Feb 04 '17 at 21:40
What does aligning the chunks mean? – hamster on wheels Feb 04 '17 at 21:44
@rxu, did you see the edit to my answer? Also take a look at: https://en.wikipedia.org/wiki/Data_structure_alignment#Allocating_memory_aligned_to_cache_lines does that clarify? – Zulan Feb 04 '17 at 22:05
yup. i get the idea now. – hamster on wheels Feb 04 '17 at 22:07
I think it might be a bad idea to define chunk due to rounding errors. I would do `begin= ithread*N/nthreads, end = (ithread+1)*N/nthreads` where `N=v.size()`. This way the product is done before the division. – Z boson Feb 06 '17 at 10:08
@Zboson My argument is that the worst case rounding error of `nthreads-1` is insignificant, but the possibility to align chunksize to cache-line is important. – Zulan Feb 06 '17 at 11:50

Parallel fill std::vector with zero

1 Answers1