1

I'm learning bandwidth/memory- and CPU-bound performance and roofline graphs at the moment, and I'd love some help/input on how to analyze the following figure.

Roofline figure from "https://www.mdpi.com/2079-3197/8/1/20"

The first analysis I'm trying to do here is which kernel out of the two--Dirac and LBM--is closer to the empirical upper-bound performance on ThunderX2. My thoughts are that Dirac is closer to the upper-bound performance on TX2 as the respective red triangle (representing TX2's performance) is closer to the roofline when on Dirac than when on LBM. Can anyone correct my justification/approach if it's incorrect?

Second one I'm trying to reach conclusion to is which architecture out of the given three (Skylake, Thunder X2, or Haswell)) is "best-suited" for LBM. There might be multiple approaches here; my guess is that SKL would be best-suited for LBM as it is the highest performing out of the three in LBM environment but could also be TX2, considering that its distance from its own roofline is the shortest among the three, hence being the most effective/suitable one for LBM.

Any input, correction, or suggestion would be greatly appreciated!

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Forrest
  • 11
  • 3
  • "Best suited" depends on cost; if you're buying CPU time by the minute, then cost in dollars. Other costs could be cost in energy per computation, then closer to roofline is closer to max performance per watt, although inherent differences in microarchitecture can make one more efficient. Or if you have a server room filled with a mix of machines and have some work to divide up, then yeah you're probably best off allocating the bandwidth-intensive (lower computational intensity) work to the machine with more bandwidth per GFLOP, saving your AVX-512 Skylake for work that can benefit. – Peter Cordes Dec 21 '21 at 08:09
  • I assume your Skylake is actually SKX (Skylake-server with AVX-512), not SKL (Skylake-client). Your GFLOPs roofline graph confused me; it would take a huge clock speed difference for SKL to be that much faster than HSW. Or actually that must be aggregating across multiple cores. – Peter Cordes Dec 21 '21 at 08:12

0 Answers0