11

I have been struggling with vectorizing a particular application for sometime now and I have tried everything. From autovectorization, to handcoded SSE intrinsics. But somehow I am unable to obtain speedup on my stencil based application.

Following is a snippet of my current code, which I have vectorized using SSE intrinsics. When I compile (Intel icc) it using -vec-report3 I constantly obtain this message:
remark: loop was not vectorized: statement cannot be vectorized.

  #pragma ivdep
  for ( i = STENCIL; i < z - STENCIL; i+=4 )
  {
    it = it2 + i;

    __m128 tmp2i = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k])),X4_i); //loop was not vectorized: statement cannot be vectorized
    __m128 tmp3 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k])),X3_i);
    __m128 tmp4 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k])),X2_i);
    __m128 tmp5 = _mm_mul_ps(_mm_add_ps(_mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]),_mm_load_ps(&p2[i+j*it_j+it_j +k*it_k])),X1_i);

    __m128 tmp6 = _mm_add_ps(_mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5)), _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i));

    _mm_store_ps(&tmp2[i],tmp6);

   }

Am I missing something crucial? Since the message doesnt elaborate as to why it cannot be vectorized, I am finding it difficult to ascertain the bottleneck.

UPDATE: After careful consideration of the suggestions, I tweaked the code the following way. I thought it best to break it down further, to identify the statements that actually are responsible for the vector dependence.

//#pragma ivdep
  for ( i = STENCIL; i < z - STENCIL; i+=4 )
  {
    it = it2 + i;
    __m128 center = _mm_mul_ps(_mm_load_ps(&p2[it]),C00_i);

    u_j4 = _mm_load_ps(&p2[i+j*it_j-it_j4+k*it_k]); //Line 180
    u_j3 = _mm_load_ps(&p2[i+j*it_j-it_j3+k*it_k]);
    u_j2 = _mm_load_ps(&p2[i+j*it_j-it_j2+k*it_k]);
    u_j1 = _mm_load_ps(&p2[i+j*it_j-it_j +k*it_k]);
    u_j8 = _mm_load_ps(&p2[i+j*it_j+it_j4+k*it_k]);
    u_j7 = _mm_load_ps(&p2[i+j*it_j+it_j3+k*it_k]);
    u_j6 = _mm_load_ps(&p2[i+j*it_j+it_j2+k*it_k]);
    u_j5 = _mm_load_ps(&p2[i+j*it_j+it_j +k*it_k]);

    __m128 tmp2i = _mm_mul_ps(_mm_add_ps(u_j4,u_j8),X4_i);
    __m128 tmp3 = _mm_mul_ps(_mm_add_ps(u_j3,u_j7),X3_i);
    __m128 tmp4 = _mm_mul_ps(_mm_add_ps(u_j2,u_j6),X2_i);
    __m128 tmp5 = _mm_mul_ps(_mm_add_ps(u_j1,u_j5),X1_i);

    __m128 tmp6 = _mm_add_ps(_mm_add_ps(tmp2i,tmp3),_mm_add_ps(tmp4,tmp5));
    __m128 tmp7 = _mm_add_ps(tmp6,center);

    _mm_store_ps(&tmp2[i],tmp7);  //Line 196

   }

When I compile (icc) the above code without #pragma ivdep I get the following message:

remark: loop was not vectorized: existence of vector dependence.
vector dependence: assumed FLOW dependence between tmp2 line 196 and tmp2 line 196.
vector dependence: assumed ANTI dependence between tmp2 line 196 and tmp2 line 196.

When I compile (icc) it with the #pragma ivdep, I get the following message:

remark: loop was not vectorized: unsupported data type. //Line 180

Why is there a dependence suggested for Line 196? How can I eliminate the suggested vector dependence?

Quuxplusone
  • 23,928
  • 8
  • 94
  • 159
PGOnTheGo
  • 805
  • 1
  • 11
  • 25
  • Simplify the `for` construct by precomputing the end value and the number of loops. – David Schwartz Jul 16 '12 at 19:38
  • It can't vectorize it because you already vectorized it. You're not getting any speedup because your computation/memory-access ratio is too low. – Mysticial Jul 16 '12 at 19:57
  • It's not the alignment which was my first thought (Mysticial corrected me), but it's definitely worth starting by simplifying the expressions for the array offsets. – Viktor Latypov Jul 16 '12 at 20:10
  • @Mystical: This is only a fragment of my code. The actual code is a 3D stencil and I have tested on data sizes as large as 512 to 1024. Which would mean the total nbr of floating point operations to be (512)^3 * 31 * 200. the 200 is the number of iterations. I had expected some improvement when scheduling 4 floating operations together (in using SSE intrinsics) comapred to 1 in a single clock cycle. – PGOnTheGo Jul 16 '12 at 20:13
  • 6
    That's not the issue. You have 10 memory accesses for only 13 operations. That's too much. The CPU will likely bottlenecked by the loads rather than the computation. From my experience, you really need to have at least a 3-to-1 ratio of computation/memory access. – Mysticial Jul 16 '12 at 20:22
  • Also, your loop body seems to have a fairly long dependency chain. I'd consider manually unrolling it by 2 to 4 iterations. – Mysticial Jul 16 '12 at 20:26
  • @Mystical: I did try implementing manual loop unrolling by factors of 2,4,8. However since this loop is the innermost (fastest dimension in the 3D stencil) on unrolling, i always got performance degradation. But when I unrolled the outermost loop I did see some improvement. – PGOnTheGo Jul 16 '12 at 20:40
  • @Mystical: Could you suggest ways I could reduce the number of memory accesses in my code. I am implementing shuffle operations in my ith dimension to reduce the number of loads, but I cant seem to figure out how that will be possible for the jth and kth dimensions, since they are not contiguous in memory (unlike dimension i) and have associated strides 'dx' and 'dx*dy' respectively. – PGOnTheGo Jul 16 '12 at 20:44
  • I have to afk. But I"ll come back to this later on. – Mysticial Jul 16 '12 at 20:49
  • 2
    The only way to reduce the memory access is to change the way you process your data. If you're making multiple passes over the same data, try grouping them together. There's not much more I can say here. You can take a look at: http://en.wikipedia.org/wiki/Loop_tiling – Mysticial Jul 16 '12 at 23:04
  • Try compiling with `-guide`. It could possibly give you an advice on what to do to enable vectorisation of the loop. You can also try the `simd` and `vector` pragmas. – Hristo Iliev Jul 19 '12 at 13:23

1 Answers1

2

The problem is that you're trying to use auto-vectorization together with hand vectorized code. The compiler says that the line can't be vectorize because you can't vectorize a vector function.

Either let the compiler to auto vectorize it, or disable auto vectorization and manually vectorize your code. As already commented too, the auto vectorizer will calculate vectorization profitability: it checks if it's worth or not to vectorize your code.

hdante
  • 7,685
  • 3
  • 31
  • 36