0

I am accelerating a MPI program using cuBlas function. To evaluate the application's efficiency, I want to know the FLOPS, memory usage and other stuff of GPU after the program has ran, especially FLOPS.

I have read the relevant question:How to calculate Gflops of a kernel. I think the answers give two ways to calculate the FLOPS of a program:

  1. The model count of an operation divided by the cost time of the operation
  2. Using NVIDIA's profiling tools

The first solution doesn't depend on any tools. But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?

The second solution uses nvprof, which is deprecated now and replaced by Nsight System and Nsight Compute. But those two tools only work for CUDA program, instead of MPI program launching CUDA function. So I am wondering whether there is a tool to profile the program launching CUDA function.

I have been searching for this question for two days but still can't find an acceptable solution.

talonmies
  • 70,661
  • 34
  • 192
  • 269
TherLF
  • 13
  • 5
  • For large matrices, I expect cuBLAS to use the strassen algorithm that is better than `O(n^3)` as heavily optimized CPU implementations do that, but one need to check the code. Besides, I also expect cuBLAS to use tensor core on new GPU which are certainly harder to profile. Finally, a MPI code generally load balance the work quite evenly between nodes so the FLOPS is the one of one node multiplied by the number of nodes assuming nodes are homogeneous. Heterogeneous computing with MPI is quite crazy anyway. – Jérôme Richard Sep 03 '22 at 10:07
  • 2
    @JérômeRichard I would never expect a library to use Strassen without explicitly being told to. GEMM can be highly optimized by cache usage and such. – Victor Eijkhout Sep 03 '22 at 12:52
  • @JérômeRichard I also think it's impossible to get FLOPS without NVIDIA's official profiling tool. But the MPI program here doesn't load balance the work. Because the matrices distributed to each node is different, the workload of GEMM for each node is different accordingly. So I should use O(N^3) / time to get each GEMM's FLOPS? Thank you for your comment. – TherLF Sep 03 '22 at 14:37
  • "But those two tools only work for CUDA program, instead of MPI program launching CUDA function. " That's incorrect. – Robert Crovella Sep 03 '22 at 15:18
  • @RobertCrovella I am new to those tow tools and the documents and blogs I have seen only profile xxx.cu program in GUI software. So maybe there is something I have missed. Thank you for your alert! I will look up the official document carefully. – TherLF Sep 03 '22 at 16:43
  • @VictorEijkhout You are right. After a quick check, It looks like fewer BLAS implementation use Strassen than I though (eg. not OpenBLAS, nor BLIS). Although it can be a bit faster when carefully implemented, it looks like the numerical stability is not as good as the standard approach and it can be an issue in applications so it make sense to mention it indeed. – Jérôme Richard Sep 03 '22 at 23:37
  • @TherLF Regarding the load balancing, it is not very clear to me what is done, but if the matrices are relatively large then you can quite-safely assume that the time is proportional to `N^3`. For small matrices, I think you cannot do this simplification since the size typically impact performance (especially on GPUs). – Jérôme Richard Sep 03 '22 at 23:44
  • @JérômeRichard The BLIS project in fact has published a few papers on Strassen, but they only do a few steps at the top level, and then do traditional BLAS under that. I don't know if that's in the released software. Stability is indeed a big consideration, and I can imagine that cache & TLB usage is also less favorable. – Victor Eijkhout Sep 04 '22 at 00:07

2 Answers2

0

But I'm not sure the meaning of model count. It's O(f(N))? Like the model count of GEMM is O(N^3)? And if I multiply two matrices of 4 x 5 and 5 x 6 and the cost time is 0.5 s, is the model count 4 x 5 x 6 = 120? So the FLOPS is 120 / 0.5 = 240?

The standard BLAS GEMM operation is C <- alpha * (A dot B) + beta * C and for A (m by k), B (k by n) and C (m by n), each inner product of a row of A and a column of B multiplied by alpha is 2 * k + 1 flop and there are m * n inner products in A dot B and another 2 * m * n flop for adding beta * C to that dot product. So the total model FLOP count is (2 * k + 3) * (m * n) when alpha and beta are both non-zero.

For your example, assuming alpha = 1 and beta = 0 and the implementation is smart enough to skip the extra operations (and most are) GEMM flop count is (2 * 5) * (4 * 6) = 240, and if the execution time is 0.5 seconds, the model arithmetic throughput is 240 / 0.5 = 480 flop/s.

I would recommend using that approach if you really need to calculate performance of GEMM (or other BLAS/LAPACK operations). This is the way that most of the computer linear algebra literature and benchmarking has worked since the 1970’s and how most reported results you will find are calculated, including the HPC LINPACK benchmark.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • By the way, do you know the model count of trsm? Or could you please tell me the document of how to compute these operation's model count? I have been searching for this but can't get an authoritative answer. – TherLF Sep 06 '22 at 13:12
0

The Using the CLI to Analyze MPI Codes states clearly how to use nsys to collect MPI program runtime information.

And the gitlab Roofline Model on NVIDIA GPUs uses ncu to collect real time FLOPS and memory usage of the program. The methodology to compute these metrics is:

Time:

sm__cycles_elapsed.avg / sm__cycles_elapsed.avg.per_second

FLOPs:

DP: sm__sass_thread_inst_executed_op_dadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_dfma_pred_on.sum + sm__sass_thread_inst_executed_op_dmul_pred_on.sum

SP: sm__sass_thread_inst_executed_op_fadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_ffma_pred_on.sum + sm__sass_thread_inst_executed_op_fmul_pred_on.sum

HP: sm__sass_thread_inst_executed_op_hadd_pred_on.sum + 2 x sm__sass_thread_inst_executed_op_hfma_pred_on.sum + sm__sass_thread_inst_executed_op_hmul_pred_on.sum

Tensor Core: 512 x sm__inst_executed_pipe_tensor.sum

Bytes:

DRAM: dram__bytes.sum

L2: lts__t_bytes.sum

L1: l1tex__t_bytes.sum

TherLF
  • 13
  • 5