Performance drop in matrix multiplication for certain sizes on AMD Polaris

Question

I have an OpenCL code that multiplies 2 matrices (GEMM) with M=4096, N=4096 and K=16. (i.e. matrices 4096 x 16 floats)

I run it on Polaris 560, 16CU GPU.

Code: https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl

I noticed very strange performance drops for this size, matrix multiplication with this size has ~8-10 GFlops performance while if I change N to 4095 or 4097 I'm getting around 130-150Gflops. I notices similar behaviour with other GEMM libraries like clblas or miopengemm - I'm getting significant performance drop for this particular size of 4096x16 and changing N by 1 boosts the performance several times.

The workload is split into work-groups of 256 threads. Each work-group handles 128x16 and 128x16 matrix tiles (8x8 block per threads).

I tried changing matrix tiling to 96x96 with 6x6 blocks instead of 128x128 with 8x8 - same result.

I tested same code with ROCm 3.7 OpenCL, Clover OpenCL and even with Windows OpenCL driver - same behavior.

There is no such issue with nvidia gtx 960 having same number of gpu cores (threads) and same memory type/size.

I suspect that this is somehow cache/collision related but I don't understand how it happens. Thus I don't know how to work-around it.

I did quite a bit research through different topics of matrix multiplication. This stackoverflow article could bring you on the right path: https://stackoverflow.com/questions/31439512/opencl-matrix-multiplication-speed Check it out and let me know what you think. Hopefully we can evaluate a proper solution for that. I noticed that ATI/AMD graphic cards not optimized as NVIDEAs are. Issues with framebuffers and taking threads locked that leads to gflops drops. — Ole Pannier, Jul 01 '21 at 22:01
@DEX7RA the general concept of efficient GEMM is already implemented - it is more specific some kind of memory access pattern cache issue problem that I can't understand — Artyom, Jul 02 '21 at 19:42
I see, but I found some high educated researches that show that Gflops drops are normal. If they are not consistant but happen sometimes irregular. That explain why they exist in smaller and even in larger matrices. — Ole Pannier, Jul 03 '21 at 18:44
Yes but the problem is the performance drop by an order of mangitude — Artyom, Jul 05 '21 at 19:01

Artyom · Accepted Answer · 2021-07-09T20:28:52.613

Finally I found that clBlas library (developed for AMD originally) handles special case of lda % 1024==0, ldb % 1024==0 probably due to cache

https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/specialCases/GemmSpecialCases.cpp#L228

I found that the better way was to rearrange blocks in z-curve order instead of queuing several kernels.

https://github.com/artyom-beilis/oclblas/blob/master/gemm/gemm.cl#L109

To handle cases M!=N or M != 1<<n I just increased number of work groups on M/N to neares 1<<n and groups that don't have jobs exit in the begging not adding too much overhead.

z-order improved performance x4 times.

Wow, cool that you found that! Sorry I didn't reply earlier but had no clue anymore and was researching. But you were faster. Glad it works! — Ole Pannier, Jul 13 '21 at 13:03

Performance drop in matrix multiplication for certain sizes on AMD Polaris

1 Answers1