Why is my convolution implementation so slow compared to the Tensorflow's one?

Question

I've implemented the VGG19 net in C++ using SIMD Instructions for inference only. I want to optimize the latency of one inference request.

Since the VGG19 consists mostly of Convolution Layers, I mainly focused on implementing an efficient Convolution Layer. I followed this paper while doing it: Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures.

My implementation delivers correct results. I use SIMD Intrisics and the algorithm described in the paper. All weights are loaded beforehand. The input and output buffers of each layer are allocated before running the actual inference.

As an example lets look at the second convolution layer of the VGG19 net:

Input: (224, 224, 64) (226, 226, 64 after Padding)
Output: (224, 224, 64)
Kernel: (3, 3, 64, 64) (KH, KW, C_IN, C_OUT)

Here is the code corresponding code:

void conv2d_block1_conv2(const float* in, const float* weights, float* out) {
    constexpr int VLEN = 8; // to use _mm256_* intrisics
    constexpr int C_OUT_B = VLEN;
    constexpr int C_IN_B = VLEN;

    constexpr int H = 226;           // Input Height
    constexpr int W = 226;           // Input Width
    constexpr int C_IN = 64;         // Input Channels

    constexpr int KH = 3;            // Kernel Height
    constexpr int KW = 3;            // Kernel Width

    constexpr int H_OUT = 224;       // Output Height
    constexpr int W_OUT = 224;       // Output Width
    constexpr int C_OUT = 64;        // Output Channels

    __m256 in_vec, weights_vec, out_vec;
    for (int c_out = 0; c_out < C_OUT / C_OUT_B; c_out++)
    for (int c_in_b = 0; c_in_b < C_IN / C_IN_B; c_in_b++)
    for (int h_out = 0; h_out < H_OUT; h_out++)
    for (int w_out = 0; w_out < W_OUT; w_out++){
        const int outIdx = LINEAR_4(c_out, h_out, w_out, 0, H_OUT, W_OUT, C_OUT_B);
        out_vec = _mm256_load_ps (&out[outIdx]);
        for (int kh = 0; kh < KH; kh++)
            for (int kw = 0; kw < KW; kw++)
                for (int c_in = 0; c_in < C_IN_B; c_in++){
                    const int inIdx = LINEAR_4(c_in_b, h_out + kh, w_out + kw, c_in, H, W, C_IN_B);
                    const int weightsIdx = LINEAR_6(c_out, c_in_b, kh, kw, c_in, 0, C_IN / C_IN_B, KH, KW, C_IN_B, C_OUT_B);
                    in_vec = _mm256_set1_ps (in[inIdx]);
                    weights_vec = _mm256_load_ps(&weights[weightsIdx]); 
                    out_vec = _mm256_fmadd_ps (in_vec, weights_vec, out_vec);
                    _mm256_store_ps(&out[outIdx], out_vec);
                }
    }
}

Note: I'm working on a linear adress space. The function LINEAR4 and LINEAR6 are mapping the multidimensional indices to a 1-d one.

array[c_out][h_out][w_out][0]         <-> LINEAR_4(c_out, h_out, w_out, 0, H_OUT, W_OUT, C_OUT_B); 
array[c_out][c_in_b][kh][kw][c_in][0] <-> LINEAR_6(c_out, c_in_b, kh, kw, c_in, 0, C_IN / C_IN_B, KH, KW, C_IN_B, C_OUT_B);

I created a function like above for every convolution layer, to give the compiler the best optimization possibilities.

However, the execution time is fairly bad. For the whole VGG19 net (both single threaded execution):

My implementation: 2400ms
Keras with Tensorflow Backend using model.predict(image): 600ms

This huge performance gap make me wonder what I'm doing wrong. I'm using clang with the -O3 flag.

So my questions are:

Are there key factors I didnt take in account?
Which implementation is Keras/TensorFlow using. how are they so fast?

TensorFlow works on multiple threads, and there seems to be little you can do to prevent that ([see this question](https://stackoverflow.com/q/60206113/1782792))... — jdehesa, Mar 06 '20 at 18:09
I think I successfully limited Tensorflow to use only one core with the code from this : [question](https://stackoverflow.com/questions/51032845/single-thread-impacts-model-accuracy-and-loss-with-tensorflow-keras-backend) My task manager at least doesnt show load on the other cores. — sp1etz, Mar 06 '20 at 18:40
[TensorFlow](https://github.com/tensorflow/tensorflow) is open source, why not to dig yourself to find out what implementation is being used? — doqtor, Mar 07 '20 at 12:34

score 2 · Answer 1 · answered Mar 12 '20 at 09:43

2

I found the reason for the poor performance. The clang Compiler only used 2 SSE Registers instead all avaiable ones. This led to unnecessary writes and reads to the L1 Cache.

I unrolled the two inner loops by hand and the compiler now uses all 16 SSE register avaible. The performance increased drastically.

If you work with SSE Intrisics, make sure to check the assembly generated.

answered Mar 12 '20 at 09:43

sp1etz

31
2

1

Without `-ffast-math`, clang *couldn't* change your loop to use multiple accumulators even if it wanted to; FP addition / FMA is not strictly associative so you'd get different rounding from doing one chain of FMAs vs. distributing across multiple dependency chains to hide FP latency. When auto-vectorizing with `-ffast-math`, clang usually *will* unroll with 4 dependency chains. (Although that's not enough to hide FMA latency if you can get 2/clock FMAs) And BTW, `__m256` is an AVX 256-bit vector that can live in a YMM register, wider than an SSE XMM register. – Peter Cordes Mar 12 '20 at 19:47
Related: [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) is about performance details of unrolling with multiple accumulators. – Peter Cordes Mar 12 '20 at 19:49
Many thanks for these details! This makes perfect sense. However I actually already set this flag and the compiler only used 2 YMM registers. I also observed, that the compiler was able to generate better code when using regular multidimensional arrays instead of doing the index calculation on my own. – sp1etz Mar 13 '20 at 08:57
IIRC, clang won't do extra unrolling on manually-vectorized loops even with `-ffast-math` (or for integer SIMD). It does unroll by 4 when *auto*-vectorizing a tiny loop, though. (Or by 2 for a small loop). Associativity is necessary to vectorize at all, of course. (Fun fact: GCC only enables `-funroll-loops` with profile-guided optimization, unlike clang) – Peter Cordes Mar 13 '20 at 09:20

Why is my convolution implementation so slow compared to the Tensorflow's one?

1 Answers1