0

I am try to optimizing the performance of the following naive program without changing the algorithm :

naive (int n, const int *a, const int *b, int *c)
//a,b are two array with given size n;
{
  for (int k = 0; k < n; k++)
    for (int i = 0; i < n - k; ++i)
      c[k] += a[i + k] * b[i];
}

My idea is as follows : First, I use OpenMP for the outer loop. For the inner loop, as it is imbalanced, I specify n-k to determine whether to use AXV2 SIMD intrinsic or simply reduce. And finally, I find that it has a speedup of 40 when n approaches to 3E7.

Are there any suggestions that could make it run faster?

My code is as follow :

static int n_zero = 0;
static int MAX_CORE = omp_get_max_threads();
void naive(int n, const int *a, const int *b, int *c)
{
    omp_set_num_threads(MAX_CORE);

#pragma omp parallel for schedule(dynamic)
    for (int k = 0; k < n; k++)
    {
        if ((n - k) < MAX_CORE)
        {
            for (int i = 0; i < (n - k); i++)
            {
                c[k] += a[i + k] * b[i];
            }
        }
        else
        {
            __m256i partial_sums = _mm256_set1_epi32(n_zero);
            for (int i = 0; i < (n - k) / 32 * 32; i += 32)
            {

                __m256i vec_a_1 = _mm256_loadu_si256((__m256i *)(a + i + k));
                __m256i vec_b_1 = _mm256_loadu_si256((__m256i *)(b + i));
                __m256i partial_pd = _mm256_mullo_epi32(vec_a_1, vec_b_1);
                partial_sums = _mm256_add_epi32(partial_pd, partial_sums);

                vec_a_1 = _mm256_loadu_si256((__m256i *)(a + i + k + 8));
                vec_b_1 = _mm256_loadu_si256((__m256i *)(b + i + 8));
                partial_pd = _mm256_mullo_epi32(vec_a_1, vec_b_1);
                partial_sums = _mm256_add_epi32(partial_pd, partial_sums);

                vec_a_1 = _mm256_loadu_si256((__m256i *)(a + i + k + 16));
                vec_b_1 = _mm256_loadu_si256((__m256i *)(b + i + 16));
                partial_pd = _mm256_mullo_epi32(vec_a_1, vec_b_1);
                partial_sums = _mm256_add_epi32(partial_pd, partial_sums);

                vec_a_1 = _mm256_loadu_si256((__m256i *)(a + i + k + 24));
                vec_b_1 = _mm256_loadu_si256((__m256i *)(b + i + 24));

                partial_pd = _mm256_mullo_epi32(vec_a_1, vec_b_1);
                partial_sums = _mm256_add_epi32(partial_pd, partial_sums);
            }

            int arr[] = {0, 0, 0, 0, 0, 0, 0, 0};
            _mm256_storeu_si256(((__m256i *)arr), partial_sums);
            for (int i = 0; i < 8; i++)
            {
                c[k] += arr[i];
            }

            for (int i = (n - k) / 32 * 32 + k; i < n; i++)
            {
                c[k] += a[i] * b[i - k];
            }
        }
    }
}

anmo
  • 83
  • 9
  • 3
    You get a "speedup of 400" on what kind of machine? – EOF May 28 '20 at 05:44
  • 1
    See [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/q/6996764) for less naive ways to do the horizontal sum outside the inner loop. Even better if you can arrange the vectorization so you're doing 8 `k` values in parallel and don't need to horizontal sum after the inner loop. Maybe with broadcast loads for each new `i` value, reducing the cache / memory bandwidth required? – Peter Cordes May 28 '20 at 06:58
  • First of all, I really suggest using something like `uint32_t*` instead of `int*` (The size of `int` is not specified (although it will usually be 32 bit on x86/x86_64) and the behavior of signed integer overflow is not defined either (it will often wrap around as expected, but only for unsigned integers that behavior is actually defined). – chtz May 28 '20 at 08:08
  • 1
    Furthermore, for convolutions of that size, you should consider asymptotically faster algorithms, e.g., something similar to [Schönhage-Strassen](https://en.wikipedia.org/wiki/Sch%C3%B6nhage%E2%80%93Strassen_algorithm) without the carry during additions (it is not exactly what you want, actually I'm not sure if that actually works in your case). – chtz May 28 '20 at 08:13
  • @chtz :I will try to find similar algorithms, thanks for your advice. – anmo May 28 '20 at 15:05
  • @Peter Cordes :I think that will improve the performance, I will try it and get back later. – anmo May 28 '20 at 15:08
  • @ EOF :Sorry for the typo. It should be 40 not 400. I have already changed it. – anmo May 28 '20 at 15:08

0 Answers0