I've implemented the VGG19 net in C++ using SIMD Instructions for inference only. I want to optimize the latency of one inference request.
Since the VGG19 consists mostly of Convolution Layers, I mainly focused on implementing an efficient Convolution Layer. I followed this paper while doing it: Anatomy Of High-Performance Deep Learning Convolutions On SIMD Architectures.
My implementation delivers correct results. I use SIMD Intrisics and the algorithm described in the paper. All weights are loaded beforehand. The input and output buffers of each layer are allocated before running the actual inference.
As an example lets look at the second convolution layer of the VGG19 net:
- Input: (224, 224, 64) (226, 226, 64 after Padding)
- Output: (224, 224, 64)
- Kernel: (3, 3, 64, 64) (KH, KW, C_IN, C_OUT)
Here is the code corresponding code:
void conv2d_block1_conv2(const float* in, const float* weights, float* out) {
constexpr int VLEN = 8; // to use _mm256_* intrisics
constexpr int C_OUT_B = VLEN;
constexpr int C_IN_B = VLEN;
constexpr int H = 226; // Input Height
constexpr int W = 226; // Input Width
constexpr int C_IN = 64; // Input Channels
constexpr int KH = 3; // Kernel Height
constexpr int KW = 3; // Kernel Width
constexpr int H_OUT = 224; // Output Height
constexpr int W_OUT = 224; // Output Width
constexpr int C_OUT = 64; // Output Channels
__m256 in_vec, weights_vec, out_vec;
for (int c_out = 0; c_out < C_OUT / C_OUT_B; c_out++)
for (int c_in_b = 0; c_in_b < C_IN / C_IN_B; c_in_b++)
for (int h_out = 0; h_out < H_OUT; h_out++)
for (int w_out = 0; w_out < W_OUT; w_out++){
const int outIdx = LINEAR_4(c_out, h_out, w_out, 0, H_OUT, W_OUT, C_OUT_B);
out_vec = _mm256_load_ps (&out[outIdx]);
for (int kh = 0; kh < KH; kh++)
for (int kw = 0; kw < KW; kw++)
for (int c_in = 0; c_in < C_IN_B; c_in++){
const int inIdx = LINEAR_4(c_in_b, h_out + kh, w_out + kw, c_in, H, W, C_IN_B);
const int weightsIdx = LINEAR_6(c_out, c_in_b, kh, kw, c_in, 0, C_IN / C_IN_B, KH, KW, C_IN_B, C_OUT_B);
in_vec = _mm256_set1_ps (in[inIdx]);
weights_vec = _mm256_load_ps(&weights[weightsIdx]);
out_vec = _mm256_fmadd_ps (in_vec, weights_vec, out_vec);
_mm256_store_ps(&out[outIdx], out_vec);
}
}
}
Note: I'm working on a linear adress space. The function LINEAR4
and LINEAR6
are mapping the multidimensional indices to a 1-d one.
array[c_out][h_out][w_out][0] <-> LINEAR_4(c_out, h_out, w_out, 0, H_OUT, W_OUT, C_OUT_B);
array[c_out][c_in_b][kh][kw][c_in][0] <-> LINEAR_6(c_out, c_in_b, kh, kw, c_in, 0, C_IN / C_IN_B, KH, KW, C_IN_B, C_OUT_B);
I created a function like above for every convolution layer, to give the compiler the best optimization possibilities.
However, the execution time is fairly bad. For the whole VGG19 net (both single threaded execution):
- My implementation: 2400ms
- Keras with Tensorflow Backend using
model.predict(image)
: 600ms
This huge performance gap make me wonder what I'm doing wrong. I'm using clang with the -O3
flag.
So my questions are:
- Are there key factors I didnt take in account?
- Which implementation is Keras/TensorFlow using. how are they so fast?