Halide CUDA GPU SGEMM implementation

Question

I am trying to build Halide-based image processing algorithm that require SGEMM function at one of its stages.

I have found that Halide has two matrix multiplication implementations:

linear algebra algorithms (apps/linear_algebra folder)
CUDA matrix multiplication application (apps/cuda_mat_mul folder)

For matrices of size 1024x1024:

First of them works quite good on CPU (Intel i7) and on Fermi GPU (GF 540M), CPU time is near to OpenBlas and Fermi GPU time is near to cuBlas (about 18ms), but this implementation works 10x slower than cuBlas on Maxwell GPU (TitanX) - 5 ms vs 0.4 ms. The second implementation (cuda_mat_mul) is 3x slower compared vs cuBlas on Fermi - about 57ms vs 18 ms, and 2x slower on Maxwell GPU compared vs cuBlas - 1 ms vs 0.4 ms

As I see - Halide can generate optimal code for Fermi GPU, but unable to run fast on Maxwell. I understand, that SGEMM function is many FusedMultiplyAdd with correct scheduling, but I cant find any optimal schedule that can make it work fast on Maxwell.

Fastest Halide code that I can imagine is placed in cuda_mat_mul folder and the schedule is:

    Func prod("prod");
    RDom r(0, size);
    prod(x, y) += A(x, r) * B(r, y);

    Var xi, yi, xio, xii, yii, xo;
    Func out = prod.in();
    out.bound(x, 0, size)
        .bound(y, 0, size)
        .tile(x, y, xi, yi, 8*32, 8)
        .split(xi, xio, xii, 32)
        .reorder(xio, yi, xii, x, y)
        .unroll(xio)
        .unroll(yi)
        .gpu_blocks(x, y).gpu_threads(xii);
    prod.compute_at(out, xii)
        .unroll(x)
        .unroll(y)
        .update()
        .unroll(r.x, 2)
        .reorder(y, x, r.x)
        .unroll(x)
        .unroll(y);
    B.in()
        .compute_at(prod, y)
        .vectorize(B.in().args()[0])
            ;

I have tried also with with larger matrices (2048x2048) - the picture looks similar:

cuBlas time: 0.003174
Halide linalg SGEMM time: 0.042568
Halide cuda_mat_mul time: 0.006792

benchmarking code comes from apps/cuda_mat_mul/runner.cpp, but changed iteration count from 10 to 100 for more precise timings

How to change the schedule to make it work with the performance near to cuBlas on TitanX?

Update: Testing on Ubuntu 16.4, LLVM 3.8, Halide - latest from git, Cuda 8

Halide CUDA GPU SGEMM implementation

0 Answers0