Faster way to do multi dimensional matrix addition?

Question

I have a matrix A of size (m * l * 4) and size of m is around 100,000 and l=100. size of list is always equal to n and n <=m. I wanted to do matrix addition of given list of indexes. I have written this function and have to call this function many many number of times.

void MatrixAddition(int l, int n, vector<int>& list, int ***A,int ***C,int cluster)
{
    for(int i=0;i<l;i++)
    {
         for(int j=0;j<4;j++)
              C[cluster][i][j]=0;
    }   

for (int i = 0; i < l; i++)
{
        for(int j=0;j<n;++j)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

}

I use gprof to calculate how much time takes each piece of function in whole code and i found my 60% of time taken by MatrixAddition function. Is there any alternative way to write this function so that my run time reduce.

time seconds seconds calls ms/call ms/call name
52.00 7.85 7.85 20 392.60 405.49 MatrixAddition(int, int, std::vector >&, int***, int***, int)

The triple-levels of indirection alone is probably killing this function. If you were trying *not* to be cache-friendly, congratulations, I think you succeeded. — WhozCraig, May 14 '16 at 22:44
You need to show us how you allocated those arrays. If you did it naively, then as @WhozCraig stated, this is not good. If you allocated one giant pool of memory and pointed the pointers in the right spots in that memory, then that's another story. — PaulMcKenzie, May 14 '16 at 23:19
It is a bad idea to call a vector "list". It is like to call a list "vector", to call a set "map", or to call your cat "Dog". — user31264, May 14 '16 at 23:32
@user3704712 [Here is a small example](http://ideone.com/YCYVAT) of allocating the entire pool of memory (the last call to `new[]`), as one contiguous block, and then pointing the relevant pointers within that pool of memory. Note only 3 calls to `new[]` are required (as well as 3 calls to `delete[]` to deallocate the memory). If instead you used a naive triple-nested loop to set up the 3D arrary where you're calling `new[]` on each iteration, then that is creating non-contiguous chunks of memory, which can slow down the processing. — PaulMcKenzie, May 15 '16 at 04:24
@PaulMcKenzie: pointers-to-pointers is going to defeat SIMD vectorization. Even if the pointers are pointing in a nice pattern (into one giant block), the compiler can't assume that at compile time, and the CPU can't assume that at run-time. So instead of looping over a big chunk of memory incrementing a pointer, it has to do a lot of pointer chasing. And I mean a *lot* of pointer chasing, because every block of 4 `int`s is pointed to be a separate pointer. WhozCraig is right; the triple indirection is a killer even with good locality. — Peter Cordes, May 15 '16 at 07:56
Allocation of A @PaulMcKenzie ` int ***A;` `A=new int**[n];` `for(int i=0;i — ankit agrawal, May 15 '16 at 08:13
@user3704712 That way of allocating is exactly what you should not be doing. It is the naive approach I was speaking of. — PaulMcKenzie, May 15 '16 at 12:31

score 0 · Answer 1 · answered May 14 '16 at 23:28

0

Exchange loop by i and loop by j in the second part. This will make the function more cache-friendly.

for(int j=0;j<n;++j)
{
    for (int i = 0; i < l; i++)
    {
        for(int k=0;k<4;k++)
            C[cluster][i][k]+=A[list[j]][i][k];
    }
}

Also, I hope you did not forget -O3 flag.

answered May 14 '16 at 23:28

user31264

6,557
3
26
40

1

There is so much more optimization-defeating stuff than just that. Besides memory layout, hoisting the `const int *tmp = A[list[j]];` out of the inner loop is a major one, because without `__restrict__` qualifiers, the compiler has to assume that stores into `C` could affect `list` entries. (They're both of type `int`, so they can alias.) – Peter Cordes May 15 '16 at 07:53

Peter Cordes · Answer 2 · 2016-05-15T07:58:31.700

(update: an earlier version had the indexing wrong. This version auto-vectorizes fairly easily).

Use a C multidimensional array (rather than an array of pointers to pointers), or a flat 1D array indexed with i*cols + j, so the memory is contiguous. This will make a huge difference to the effectiveness of hardware prefetching to make good use of memory bandwidth. Loads with an address coming from another load really suck for performance, or conversely, having predictable addresses known well ahead of time helps a lot because the loads can start well before they're needed (thanks to out-of-order execution).

Also, @user31264's answer is correct, you need to interchange the loops so the loop over j is the outer-most. This is good but nowhere near sufficient on its own.

This will also allow the compiler to auto-vectorize. Actually, I had a surprisingly hard time getting gcc to auto-vectorize well. (But that's probably because I got the indexing wrong, because I only looked at the code the first time. So the compiler didn't know we were looping over contiguous memory.)

I played around with it on the Godbolt compiler explorer.

I finally got good nice compiler output from this version, which takes A and C as flat 1D arrays and does the indexing itself:

void MatrixAddition_contiguous(int rows, int n, const  vector<int>& list,
                               const int *__restrict__ A, int *__restrict__ C, int cluster)
  // still auto-vectorizes without __restrict__, but checks for overlap and
  // runs a scalar loop in that case
{
  const int cols = 4;  // or global constexpr or something
  int *__restrict__ Ccluster = C + ((long)cluster) * rows * cols;

  for(int i=0;i<rows;i++)
    //#pragma omp simd  
    for(int k=0;k<4;k++)
      Ccluster[cols*i + k]=0;

  for(int j=0;j < cols;++j) { // loop over clusters in A in the outer-most loop
    const int *__restrict__ Alistj = A + ((long)list[j]) * rows * cols;
    // #pragma omp simd    // Doesn't work: only auto-vectorizes with -O3
    // probably only -O3 lets gcc see through the k=0..3 loop and treat it like one big loop
    for (int i = 0; i < rows; i++) {
      long row_offset = cols*i;
      //#pragma omp simd  // forces vectorization with 16B vectors, so it hurts AVX2
      for(int k=0;k<4;k++)
        Ccluster[row_offset + k] += Alistj[row_offset + k];
    }
  }
}

Manually hoisting the list[j] definitely helped the compiler realize that stores into C can't affect the indices that will be loaded from list[j]. Manually hoisting the other stuff probably wasn't necessary.

Hoisting A[list[j]], rather than just list[j], is an artifact of a previous approach where I had the indexing wrong. As long as we hoist the load from list[j] as far as possible, the compiler can do a good job even if it doesn't know that list doesn't overlap C.

The inner loop, with gcc 5.3 targeting x86-64 -O3 -Wall -march=haswell -fopenmp (and -fverbose-asm) is:

.L26:
    vmovdqu ymm0, YMMWORD PTR [r8+rax]        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B]
    vpaddd  ymm0, ymm0, YMMWORD PTR [rdx+rax]   # vect__71.75, MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], MEM[base: vectp.73_90, index: ivtmp.91_26, offset: 0B]
    add     r12d, 1   # ivtmp.88,
    vmovdqu YMMWORD PTR [r8+rax], ymm0        # MEM[base: Ccluster_20, index: ivtmp.91_26, offset: 0B], vect__71.75
    add     rax, 32   # ivtmp.91,
    cmp     r12d, r9d # ivtmp.88, bnd.66
    jb      .L26        #,

So it's doing eight adds at once, with AVX2 vpaddd, with unaligned loads and an unaligned store back into C.

Since this is auto-vectorizing, it should make good code with ARM NEON, or PPC Altivec, or anything that can do packed 32bit addition.

I couldn't get gcc to tell me anything with -ftree-vectorizer-verbose=2, but clang's -Rpass-analysis=loop-vectorize was slightly helpful.

Faster way to do multi dimensional matrix addition?

2 Answers2

Linked