0

Problem

I'm trying to calculate CPU / GPU FLOPS performance but I'm not sure if I'm doing it correctly.

Let's say we have:

  • A Kaby Lake CPU (clock: 2.8 GHz, cores: 4, threads: 8)
  • A Pascal GPU (clock: 1.3 GHz, cores: 768).

This Wiki page says that Kaby Lake CPUs compute 32 FLOPS (single precision FP32) and Pascal cards compute 2 FLOPS (single precision FP32), which means we can compute their total FLOPS performance using the following formulas:

CPU:

TOTAL_FLOPS = 2.8 GHz * 4 cores * 32 FLOPS = 358 GFLOPS

GPU:

TOTAL_FLOPS = 1.3 GHz * 768 cores * 2 FLOPS = 1996 GFLOPS

Questions

  1. [SOLVED] Most of the guides I've seen (like this one) are using physical cores in the formula. What I don't understand is why not use threads (logical cores) instead? Weren't threads created specifically to double the floating point calculations performance? Why are we ignoring them then?

  2. Am I doing it correctly at all? I couldn't find a single reliable source for calculating FLOPS, all the information on the internet is contradicting. For the i7 7700HQ Kaby Lake CPU I found FLOPS values as low as 29 GFLOPS even though the formula above gives us 358 GFLOPS. I don't know what to believe.

  3. [EDITED] Is there a cross-platform (Win, Mac, Linux) library in Node.js / Python / C++ that gets all the GPU stats like shading cores, clock, FP32 and FP64 FLOPS values so I could calculate the performance myself, or a library that automatically calculates the max theoretical FP32 and FP64 FLOPS performance by utilizing all available CPU / GPU instruction sets like AVX, SSE, etc? It's quite ridiculous that we cannot just get the FLOPS stats from the CPU / GPU directly, instead we have to download and parse a wiki page to get the value. Even when using C++, it seems (I don't actually know) we have to download the 2 GB CUDA toolkit just to get access to the Nvidia GPU information - which would make it practically impossible to make the app available for others, since no one would download a 2 GB app. The only library I could find is a 40 year old C library, written when advanced instructions didn't even exist yet.

AlekseyHoffman
  • 2,438
  • 1
  • 8
  • 31
  • 1
    Hyperthreading (SMT) exists to keep the execution units in a physical core fed with work on more cycles, not to increase theoretical max throughput for "efficient" code that doesn't stall in the first place. http://www.lighterra.com/papers/modernmicroprocessors/ has a section on SMT. And see https://en.wikipedia.org/wiki/Simultaneous_multithreading. Anger Fog's x86 microarch guide (https://www.agner.org/optimize/) has much more detail, if you want to really understand how CPUs work in more detail than the first guide. – Peter Cordes Nov 17 '20 at 16:05
  • max FLOPs per clock cycle is not that hard to microbenchmark. [How do I achieve the theoretical maximum of 4 FLOPs per cycle?](https://stackoverflow.com/q/8389648) shows how for CPUs that predate FMA. It's not much different for CPUs with FMA, just create some code that compiles to something bound on FMA throughput (not latency), with few other instructions in the loop and no memory access. – Peter Cordes Nov 17 '20 at 22:59
  • @PeterCordes thanks for the suggestion, I'll look into it – AlekseyHoffman Nov 17 '20 at 23:06

0 Answers0