Using perf stat for multithreaded applications

Question

I have wrote a basic three loop matrix multiplication using a serial and OpenMP implementations. For the same size (3200x3200), the perf stat -a -e instructions,cycles shows:

Serial

   265,755,992,060        instructions              #    0.71  insn per cycle
   375,319,584,656        cycles

      85.923380841 seconds time elapsed

Parallel (16 threads)

   264,524,937,733        instructions              #    0.30  insn per cycle
   883,342,095,910        cycles

      13.381343295 seconds time elapsed

In the parallel run, I expect that the number of cycles roughly be the same as serial run. But it isn't.

Any thoughts for the differences?

UPDATE:

I rerun the experiments with 8 and 16 threads since the processor has maximum 16 threads.

Using 8 threads
Max nthread = 16
Total execution Time in seconds: 13.4407235400
MM execution Time in seconds: 13.3349801241

 Performance counter stats for 'system wide':

            906.51 Joules power/energy-pkg/
   264,995,810,457        instructions              #    0.59  insn per cycle
   449,772,039,792        cycles

      13.469242993 seconds time elapsed

and

Using 16 threads
Max nthread = 16
Total execution Time in seconds: 13.2618084711
MM execution Time in seconds: 13.1565077840

 Performance counter stats for 'system wide':

          1,000.39 Joules power/energy-pkg/
   264,309,881,365        instructions              #    0.30  insn per cycle
   882,881,366,456        cycles

      13.289234564 seconds time elapsed

As you see, the wall clocks are the same, roughly but the cycles for 16 threads are 2 times of 8 threads. That means with higher cycle count and lower IPC, it is possible to keep the wall clock as before with more threads. According to perf list, the event is cpu-cycles OR cycles [Hardware event]. I would like to know is that the average cycles for one core or aggregate N cores? Any comment on that?

`cycles` is aggregate cycles across all threads. Normally you want to also count an event like `task-clock` that tells you how many CPU-seconds of total CPU time was used by all your threads. (Which also lets perf calculate an average frequency = cycles / task-clock for the logical cores your threads ran on.) Anyway, looks like your code barely benefits at all from using both logical cores on the same physical core, probably maxing out memory-level parallelism with a single thread due to the terrible access-pattern of your naive matmul. — Peter Cordes, Mar 05 '22 at 19:52
What compiler / options did you compile with? Did it include `-ffast-math`? — Peter Cordes, Mar 05 '22 at 19:53
How do you justify the same wall clock and instruction counts vs. different cycles? — mahmood, Mar 06 '22 at 08:34
If you divide the same total amount of work across more threads, instruction count stays very close to the same. Running 2 threads per phys core instead of 1 doesn't gain any throughput, apparently one thread per core maxes out the cache-miss throughput. So you spend the same wall-clock time, but with twice as many logical core clocks ticking. So we're seeing a case of zero SMT scaling, but pretty good scaling with physical cores. — Peter Cordes, Mar 06 '22 at 08:44

score 3 · Answer 1 · answered Mar 05 '22 at 10:25

Assuming your process perfectly scale, the number of instructions will be shared amongst all cores and the total number of instructions will be the same. The same thing applies for the number of cycles. However, when your process does not scale, the number of instruction should be the same but the number of cycle increase. This is generally due to the contention of a shared resource that cause stall cycles in the pipeline.

In your parallel implementation, the memory hierarchy is not efficiently used since your parallel implementation cause a lot of cache misses that could saturate the L3 or the RAM when many threads are used, hence memory stalls. If you use simultaneous multithreading (aka Hyper-threading), this can also cause this problem because two threads on the same core often does not run truly in parallel (some part of the core are shared between threads).

score 1 · Answer 2 · answered Mar 05 '22 at 10:24

1

Constant instructions makes sense; there's the same amount of total work to do, whether those instructions all run on the same core or not.

Are you using SMT such as hyperthreading? IPC goes down when the same physical core divides its time between two logical cores. For some programs scaling, SMT increases overall throughput; for cache-bound programs (like a naive matmul) having 2 threads compete for the same L1/L2 cache hurts.

Otherwise, if this is 16 physical cores, you might be seeing that effect for L3 contention, when a single thread doesn't have all the L3 to itself.

Anyway, with proper cache-blocking, SMT usually hurts overall throughput for matmul / dense linear algebra. A physical core can be saturated with ALU work by one thread with well-tuned code, so contention for per-core caches just hurts. (But multi-threading definitely helps overall time, like it did in your case.) A cache-blocked matmul is usually 5 nested loops, as in the example near the end of What Every Programmer Should Know About Memory?

Agner Fog (https://agner.org/optimize/) also mentions hyperthreading hurting for some workloads in his microarch PDF.

answered Mar 05 '22 at 10:24

Peter Cordes

328,167
45
605
847

The processor is Ryzen 7 3700X 8-Core where 16 threads are launched when setting the number of threads to `omp_get_num_procs()`. – mahmood Mar 05 '22 at 14:58
2

@mahmood: Ok yes, SMT with 2 logical cores per physical. You might see better scaling with limiting OpenMP to 8 threads. – Peter Cordes Mar 05 '22 at 14:58
I have updated the post. May I know your comment about that. Do you agree with the results? – mahmood Mar 05 '22 at 22:03
@mahmood: I already commented under the question a couple hours ago, soon after the first time you posted that comment. – Peter Cordes Mar 05 '22 at 22:11
Constant instructions only makes sense if you assume that under imbalance waiting executes no instructions. That is not, generally, true. Many OpenMP run-times poll,. for a while before sleeping in the kernel, because that can reduce the overall elapsed time. See OMP_WAIT_POLICY https://www.openmp.org/spec-html/5.0/openmpse55.html and the KMP_BLOCKTIME envirable in the LLVM runtime https://www.intel.com/content/www/us/en/develop/documentation/cpp-compiler-developer-guide-and-reference/top/compilation/supported-environment-variables.html – Jim Cownie Mar 07 '22 at 09:53
@JimCownie: There's no locking; all threads can be working on different parts of the output array (matrix). So yes, the cost of contention here is just stalling the CPU core on demand loads, not on acquiring a lock. Fair point that if you were doing a lot of really fine-grained reductions or something, you might end up with some spinning. As you say, that's the general case, not this case. – Peter Cordes Mar 07 '22 at 12:51
@PeterCordes. You don't need locks to create load imbalance. A simple strategic schedule in an OpenMP for loop can easily do it... – Jim Cownie Mar 07 '22 at 18:24
@JimCownie: Oh right, scheduling chunks of work, I was forgetting that OMP wouldn't necessarily use static scheduling. But if it did, that would be fine; I expect each thread to run about the same speed. – Peter Cordes Mar 07 '22 at 18:30
@PeterCordes even with static scheduling there can be load imbalance which leads to threads waiting at a barrier (either the implicit one at the end of a parallel for, or at the join at the end of the parallel region. But, in a serial region, all but one thread is waiting too, and they may be polling then... – Jim Cownie Jun 13 '22 at 08:38

Victor Eijkhout · Answer 3 · 2022-03-05T18:55:06.477

1

Matrix-matrix multiplication is the typical example of an operation that theoretically has lots of cache reuse, and so can run close to peak speed. except that the standard three-loop implementation doesn't. Optimized implementations that use the cache efficiently are actually six levels deep: each of your three loops gets tiled, and you need to interchange loops and set the tilings just right. This is not trivial.

So your implementation basically uses no cache. Maybe you can have some effects from the L3 cache, but, given a sufficient problem size, certainly not L1, and likely not L2. In other words: you are bandwidth-bound. And then the problem is that you may have 16 cores, but you don't have enough bandwidth to feed all of those.

I must say that your factor of 6--7 is a little disappointing, but maybe your architecture just doesn't have enough bandwidth. I know that for top-of-the-line nodes I would expect something like 12. But why don't you test it? Write a benchmark that does nothing but stream data to and from memory. Vector-vector addition. And then see how many cores you can get linear speed up.

To address some points you raise in your reply:

Speedup is nice, but you should take a look at performance. Matmul can run at 90+ percent of peak. Use MKL or any optimized BLAS, and compare that to what you get.
SIMD makes no difference in speedup, only in absolute performance.
In that i,j,k update you're not stating which index is in the inner loop. Regardless, let your compiler generate an optimization report. You'll find that compilers are that clever, and may very well have interchanged loops to get vectorization.
FP latency is not a concern in kernels as simple as yours. Between prefetching and out-of-order execution and whatnot you really don't have to worry much about latency.

Really: your performance is bandwidth-limited. But you're only measuring speedup so you don't really see that, apart from the fact that using all your cores will saturate your bandwidth and cap your speedup.

edited Mar 05 '22 at 18:55

answered Mar 05 '22 at 14:57

Victor Eijkhout

5,088
2
22
23

Turns out it's not a 16-core machine, it's 8c16c Ryzen 7 3700X 8-Core. So a 6 to 7x speedup is close to linear with physical cores. (Surprisingly good for a naive matmul; although maybe not that surprising since `result[i][j] += first[i][k] * second[k][j];` won't vectorize with SIMD unless compilers are way smarter than I expect, and bottlenecks on FP latency, so each core isn't using that much bandwidth. But if it was just the FP latency then SMT would help. – Peter Cordes Mar 05 '22 at 15:02
You say "your reply" like I was the asker of the original question. I'm actually the author of one of the other answers. Your point 4 is nonsense; out-of-order execution can't help with one long dependency chain that goes through `sum`, and unless you compile with `-ffast-math` the compiler can't unroll for you with multiple accumulators. The only thing that might make it *not* a problem is if a memory bottleneck (from striding down one column with that terrible access pattern) limits throughput to less than 1 element per 3 clock cycles (the latency of addss/sd on Zen1/2/3). – Peter Cordes Mar 05 '22 at 19:13
See my answer on [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: the effect of FP latency on throughput. Or [loop unrolling not giving expected speedup for floating-point dot product](https://stackoverflow.com/q/71274300) for another example. The fact that the loop is a simple reduction with only a trivial amount of independent work per element is what makes FP latency a problem. (The inner loop of a naive matmul like this is a dot product with strided loads) – Peter Cordes Mar 05 '22 at 19:16
Agreed with your conclusion, though: it probably is bandwidth-limited because of the terrible access pattern. The snippet of code I mentioned in my first comment is copied from the pastebin links in the question; a good question would have included them directly, but I was curious enough to look and confirm a fully naive matmul (`k` is the inner-most loop). The info on what machine it is (8 core) is from the OPs comment under my answer. – Peter Cordes Mar 05 '22 at 19:18
@PeterCordes Didn't see that you were not the asker. Sorry. OOO: yeah, probably not relevant here, but prefetching probably is, and so *memory* latency is not a worry. I'll read your link about FP latency. SIMD & unroll is an interesting point. Considering that OpenMP is used, any hope of getting the exact-arithmetic window is already out the window, so maybe the compiler uses that as an excuse to do unrolling. I'd have to check into that. – Victor Eijkhout Mar 05 '22 at 19:56
If prefetching was able to have the memory ready for each load, then FP ALU latency *would* be the bottleneck, limiting it to one sum every 3 clocks. The problem is that the access pattern is terrible so hardware prefetch can't keep up with that. – Peter Cordes Mar 05 '22 at 19:57
Yes, good point that OpenMP may give the compiler the freedom to treat FP math as associative within that loop. But they're only using `#pragma omp parallel for` on the *outer* loop, *not* `omp simd reduction(+:sum)` on the inner loop. I don't know OpenMP that well (I usually manually vectorize with intrinsics) but I think without `reduction` it isn't giving the compiler the freedom to reassociate the `sum +=` in the inner loop. It can parallelize by having different threads work on different outer-loop iters (`result[i][j]` sums), if it can prove non-overlap between result and the sources. – Peter Cordes Mar 05 '22 at 20:02

Using perf stat for multithreaded applications

3 Answers3