I was inspired by this question and wondering whether it's possible to use multiple SIMD instructions at the same time, since a CPU core may have multiple vector processing unit (page 5 of this slides).
The code is:
#include <algorithm>
#include <ctime>
#include <iostream>
int main()
{
// Generate data
const unsigned arraySize = 32768;
int data[arraySize];
for (unsigned c = 0; c < arraySize; ++c)
data[c] = std::rand() % 256;
long long sum = 0;
for (unsigned i = 0; i < 100000; ++i)
{
// Primary loop
for (unsigned c = 0; c < arraySize; ++c)
{
if (data[c] >= 128)
sum += data[c];
}
}
return 0;
}
The assembly code compiled: compiled for AVX512 and compiled for AVX2
After inspecting the assembly code, I discovered that the inner loop (array traversal) was vectorized. In the case of AVX512 (-march=knl, knights landing), each step consists of processing 64 elements, by calling 8 SIMD instructions, each adding 8 elements to the previous result.
The intermediate result is stored in 4 zmm
registers, each consisting of 8 elements. Finally 4 zmm
registers will be reduced to a single result sum
. It seems these SIMD instructions are called serially because it uses the same zmm5
register to store intermediate variable.
a piece of assembly:
# 4 SIMD
vpmovzxdq zmm5, ymm5 # extends 8 elements from int (32) to long long (64)
vpaddq zmm1, zmm1, zmm5 # add to the previous result
vpmovzxdq zmm5, ymm6 # They are using the same zmm5 register
vpaddq zmm2, zmm2, zmm5 # so I think they are not parallelized
vpmovzxdq zmm5, ymm7
vpaddq zmm3, zmm3, zmm5
vpmovzxdq zmm5, ymm8
vpaddq zmm4, zmm4, zmm5
# intermediate result stored in zmm1~zmm4
# read additional 32 elements and repeat the above routine once
# in total 8 SIMD and 64 elements in each FOR step after compilation
My questions is, according to Intel, Knights Landing CPU have 2 vector processing units for each core (page 5 of this slides). Therefore, would it be possible to do 2 AVX512 SIMD computation at the same time?