I want to vectorize the following loop on ARM NEON and SSE:
for (int i = 0; i < n; ++i) {
b[i][0] = 0.0;
for (int j = 1; j < n; ++j) {
b[i][j] = b[i][j - 1] + a[i][j];
}
}
This loop has a loop-carried dependency, so it cannot be vectorized simply. Instead, it can be vectorized using outer-loop vectorization.
Under the assumption that a
and b
are float32, outer-loop vectorization functions by running 4 instances of the inner loop in parallel. Here is what the iterations of the vectorized loop look like:
First iteration: { { i = 0, j = 1 }, { i = 1, j = 1 }, { i = 2, j = 1 }, { i = 3 , j = 1 } }
Second iteration: { { i = 0, j = 2 }, { i = 1, j = 2 }, { i = 2, j = 2 }, { i = 3 , j = 2 } }
Etc.
I want to vectorize this loop using NEON and SSE. Neither of them support gathers and scatters, which are needed to simply vectorize this loop. Do you have any idea how to vectorize this loop efficiently?