pragma omp for of atomic operations on a histogram

Question

I'm having trouble in efficiently parallelizing the next line of code:

#   pragma omp for nowait
    for (int i = 0; i < M; i++) { 
#       pragma omp atomic
        centroids[points[i].cluster].points_in_cluster++;
    }

This runs, I guess due to the omp for overhead, slower than this:

#   pragma omp single nowait
    for (int i = 0; i < M; i++) { 
        centroids[points[i].cluster].points_in_cluster++;
    }

Is there any way to make this go faster?

Having "work" that (almost) entirely consists of atomics which are inherently sequential in a parallel loop will never make sense performance wise. You can try using a reduction clause `reduction(+: centroids[0:M]` instead, but I wouldn't bet on this performing better than sequential either, as there just isn't enough computation/arithmetic in that loop. — paleonix, Jan 30 '23 at 09:42
@paleonix I assume that `centroids` has a much smaller size than M. Anyway I agree that a reduction is the right solution here. Since this loop cannot be vectorized the threaded version has possibly some chance to be faster. — PierU, Jan 30 '23 at 10:35
@PierU yeah, I saw that bad assumption when it was too late to change. I just wanted to show the syntax as many devs probably only know the scalar version. — paleonix, Jan 30 '23 at 10:37
@paleonix and @pierU : I think you're too pessimistic. If M is large, parallelizing the loop will give a perfect speedup. The extra cost of combining `nthreads` small arrays for the centroids (and of creating them in the beginning) is probably going to be negligible. — Victor Eijkhout, Jan 30 '23 at 14:20
@VictorEijkhout Considering that there seems to be a big parallel region around it an therefore `points` might be cached at the right core, it could work out, yes. I would not suggest it if I were sure that it was a bad idea. There is just not enough information to make any guarantees, so I try to keep expectations in check. — paleonix, Jan 30 '23 at 14:36
@VictorEijkhout Well, I wrote that it had chances to be faster... — PierU, Jan 30 '23 at 14:45
On most architectures, including x86-64, the processor execute atomic operation by locking the target cache lines typically stored in a shared L3 cache having a significant latency. If two cores access to the same cache line, the operation is serialized. Regarding the `centroids` data structure (which is not provided), multiple `points_in_cluster` can share the same cache line and serialize accesses to different `centroids`. Based on past answers, the number of centroid is small (~10) so most accesses will be serialized like a mutex does. — Jérôme Richard, Jan 30 '23 at 14:54
Note also that atomic operations are much slower than basic ones due to the atomicity and the latency of the L3 cache compared to the L1. This means sequential accesses will be slower, and in practice, parallel accesses will be even slower (due to other low-level architecture considerations). Parallel access can only be faster if `atomic_latency / number_of_core < L1_latency + copy_overhead`. The thing is the L3 latency tends to grow (non-linearly) with the number of cores since its latency is related to its size and more cores means a bigger cache and a higher latency and slower atomics. Sad. — Jérôme Richard, Jan 30 '23 at 15:07
@paleonix: histogramming is real work. It's not ALU work, though, it's mostly memory-level parallelism that you want more of. Atomic increments on separate cache lines can run in parallel on different cores. But without any parallelism between increments, vs. a single thread with non-atomic increments maybe having 12 cache misses in flight, or for a small array hot in L1d, pipelining the 6 cycle latency of memory-destination add to do about 1/clock in one core. (vs. one increment per about 18 cycles per core best case with `lock add`, or worse because it's also a barrier for loading next i) — Peter Cordes, Jan 30 '23 at 16:24
So you'd need maybe 23 cores to break even on parallelizing with atomic increments on x86. (5 cycle load-use latency for `points[i].cluster` which can only start after a previous `lock add` full barrier, so another approx 18 cycles on Ice Lake for that. https://uops.info/. Maybe schedule / software pipelien the load of one point ahead of an increment of another?) — Peter Cordes, Jan 30 '23 at 16:30
@JérômeRichard: Anyway, for a small histogram table, the solution is simple: replicate it so each thread has a private array of counts, which you add at the end. (With SIMD.) Perhaps even give each thread multiple copies of the array to avoid store-forwarding stalls, unless you're on Ice Lake or Zen 2 (or some non-x86 ISA with this.) [Methods to vectorise histogram in SIMD?](https://stackoverflow.com/q/12985949) / [How to optimize histogram statistics with neon intrinsics?](https://stackoverflow.com/q/38501529) — Peter Cordes, Jan 30 '23 at 16:33
@PeterCordes I should have worded it better, e.g. using "arithmetic complexity" instead of "work" and saying that it might not scale due to being bandwidth bound. — paleonix, Jan 30 '23 at 16:43
@paleonix: Usually it won't be bandwidth bound, not on DRAM / memory controller bandwidth, especially if the count array isn't huge and fits in L3 or L2 cache. The single-threaded version might be bound on the amount of memory-level parallelism one core can sustain, so there's reason to hope multiple cores could give a speedup. (The problem being that sharing the count array with atomic increments destroys memory-level parallelism, but for small enough count arrays, you can replicate those, like compilers might do with your answer's `reduction(+:points_in_clusters[0:num_centroids])` ) — Peter Cordes, Jan 30 '23 at 17:09
@JérômeRichard: Fun fact: Intel's [2022.dec future-extensions manual](https://cdrdv2-public.intel.com/671368/architecture-instruction-set-extensions-programming-reference.pdf) has a section (chapter 13) on RAO (remote atomics), new instructions for this use-case: a core sends off an atomic-add to L3/memory, and it gets done at the L3, weakly ordered like WC (but not evicting the data), without having to return any result back to the core like a store. Planned for Grand Ridge (many-core Atom, only Gracemont cores) in about 2024. https://www.phoronix.com/news/Intel-RAO-INT-Grand-Granite — Peter Cordes, Jan 31 '23 at 22:15
With small histograms, each thread replicating the counts array is going to be even better, but for large histograms or continuously-updated ones, this could be a big win. — Peter Cordes, Jan 31 '23 at 22:20
@PeterCordes Great! This seems promising. I was waiting for a long time for such basic operations to be faster. GPUs like Nvidia ones have such optimized atomic instructions that make them scale relatively well (and make codes far simpler than without atomics which is critical for optimized GPU codes that are often already hard to read). One point is unclear to me: it seems the resulting updated value is not accessible from the core. If so this is a bit sad since it would be useful for many reduction algorithms (eg. parallel appending), and a perfect implementation for C++ relaxed atomics. — Jérôme Richard, Feb 01 '23 at 00:24
@JérômeRichard: Yeah, thought the same thing, but there are advantages to this design. It can look like a special kind of NT/WC store to the rest of the memory subsystem (including "write"-combining for multiple operations on the same cache line), where the natural thing is to *not* need a reply to the core that sent it. I wonder if sending back the old value to match each operation done at L3 or memory controller would require architecting the system for that in the first place, and this was a more natural thing to bolt on without a redesign? (Although for now just into Gracemont) — Peter Cordes, Feb 01 '23 at 02:51

paleonix · Answer 1 · 2023-02-03T16:14:20.207

Theory

While atomics are certainly better than locks or critical regions due to their implementation in hardware on most platforms, they are still in general to be avoided if possible as they do not scale well, i.e. increasing the number of threads will create more atomic collisions and therefore more overhead. Further hardware-/implementation-specific bottlenecks due to atomics are described in the comments below the question and this answer by @PeterCordes.

The alternative to atomics is a parallel reduction algorithm. Assuming that there are much more points than centroids, one can use OpenMP's reduction clause to let every thread have a private version of centroids. These private histograms will be consolidated in an implementation-defined fashion after filling them.

There is no guarantee that this technique is faster than using atomics in every possible case. It could not only depend on the size of the two index spaces, but also on the data as it determines the number of collisions when using atomics. A proper parallel reduction algorithm is in general still expected to scale better to big numbers of threads.

Practice

The problem with using reduce in your code is the Array-of-Structs (AoS) data layout. Specifying

# pragma omp for reduction(+: centroids[0:num_centroids])

will produce an error at build time, as the compiler does not know how to reduce the user-defined type of centroids. Specifying

# pragma omp for reduction(+: centroids[0:num_centroids].points_in_cluster)

does not work either as it is not a valid OpenMP array section.

One can try to use an custom reduction here, but I do not know how to combine a user-defined reduction with OpenMP array sections (see the edit at the end). Also it could be very inefficient to create all the unused variables in the centroid struct on every thread.

With a Struct-of-Array (SoA) data layout you would just have a plain integer buffer, e.g. int *points_in_clusters, which could then be used in the following way (assuming that there are num_centroids elements in centroids and now points_in_clusters):

#   pragma omp for nowait reduction(+: points_in_clusters[0:num_centroids])
    for (int i = 0; i < M; i++) { 
        points_in_clusters[points[i].cluster]++;
    }

If you cannot just change the data layout, you could still use some scratch space for the OpenMP reduction and afterwards copy the results back to the centroid structs in another loop. But this additional copy operation could eat into the savings from using reduction in the first place.

Using SoA also has benefits for (auto-) vectorization (of other loops) and potentially improves cache locality for regular access patterns. AoS on the other hand can be better for cache locality when encountering random access patterns (e.g. most sorting algorithms if the comparison makes use of multiple variables from the struct).

PS: Be careful with nowait. Does the following work really not depend on the resulting points_in_cluster?

EDIT: I removed my alternative implementation using a user-defined reduction operator as it was not working. I seem to have fixed the problem, but I do not have enough confidence in this implementation (performance- and correctness-wise) to add it back into the answer. Feel free to improve upon the linked code and post another answer.

It's not just contention that make atomics slow. It's that (at least on x86), atomic RMW is only possible with instructions that are also full memory barriers, so there's no memory-level parallelism (within one thread) for atomic increments of independent locations. `lock add mem, imm8` has one per 18 cycle throughput on Ice Lake for example (https://uops.info), and that probably also stops the load of the next `points[i].cluster` from starting until after that store completes, so maybe one increment per 18 or 23 cycles, vs. 1 per cycle with plain operations in single-threaded code. — Peter Cordes, Jan 30 '23 at 18:53
@PeterCordes Yes, the "scaling" argument is somewhat academic/theoretical especially if the code is not targeted at HPC (although core numbers in consumer chips have grown as well...). It is still convenient as a motivation for avoiding atomics because it "works" independent of the exact implementation of atomics/platform. Thanks for the addtional details! — paleonix, Jan 30 '23 at 19:02

pragma omp for of atomic operations on a histogram

1 Answers1

Theory

Practice