8

I'm trying to get my code to auto vectorize, but it isn't working.

int _tmain(int argc, _TCHAR* argv[])
{
    const int N = 4096;
    float x[N];
    float y[N];
    float sum = 0;

    //create random values for x and y 
    for (int i = 0; i < N; i++)
    {
        x[i] = rand() >> 1;
        y[i] = rand() >> 1;
    }

    for (int i = 0; i < N; i++){
        sum += x[i] * y[i];
    }
}

Neither loop vectorizes here, but I'm really only interested in the second loop.

I'm using visual studio express 2013 and am compiling with the /O2 and /Qvec-report:2(To report whether or not the loop was vectorized) options. When I compile, I get the following message:

--- Analyzing function: main
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(28) : info C5002: loop not vectorized due to reason '1200'
c:\users\...\documents\visual studio 2013\projects\intrin3\intrin3\intrin3.cpp(41) : info C5002: loop not vectorized due to reason '1305'

Reason '1305', as can be seen HERE, says that "the compiler can't discern proper vectorizable type information for this loop." I'm not really sure what this means. Any ideas?

After splitting the second loop into two loops:

for (int i = 0; i < N; i++){
    sumarray[i] = x[i] * y[i];
}

for (int i = 0; i < N; i++){
    sum += sumarray[i];
}

Now the first of the above loops vectorizes, but the second one does not, again with error code 1305.

Jon B. Jones
  • 115
  • 1
  • 11
  • What SIMD host are you compiling for? It might be that the host doesn't provide the required instructions (I suspect you'd need a horizontal add. Have you tried breaking the loop into two: one to produce a `sum[]` vector, and a second loop which then adds the elements in `sum[]`? That may at least narrow down what's happening. – Jens Apr 30 '14 at 03:50
  • I'm not sure what you mean by SIMD host. My CPU is an intel core i7. Also, really sorry but I misposted that the first loop does get vectorized, which is NOT true. I've updated the post and the output message. Thanks for your suggestion, going to try breaking the loop up now. Is it necessary to make it a vector or can I use a sum array? – Jon B. Jones Apr 30 '14 at 06:09
  • Broke the loop up and added the relevant information to the question. – Jon B. Jones Apr 30 '14 at 07:18
  • Try adding "/arch:AVX" to command line. It should enable generation of additional SIMD instructions. – Ville Krumlinde Apr 30 '14 at 07:48
  • @JonB.Jones: With SIMD host I mean, what CPU are you producing code for, what SIMD instructions are available? Sometimes a loop *can* be vectorized but then code *can not* be generated because the CPU does not provide those SIMD instructions required for vectorizing the loop. – Jens Apr 30 '14 at 11:04

2 Answers2

9

The error 1305 happens because the optimizer did not vectorize the loop since the value sum is not used. Simply adding printf("%d\n", sum) fixes that. But then you get a new error code 1105 "Loop includes a non-recognized reduction operation". To fix this you need you need to set /fp:fast

The reason is that floating point arithmetic is not associative and reductions using SIMD or MIMD (i.e. using multiple threads) need to be associative. By using a looser floating point model you can do the reduction.

I just tested it with the following code and the default fp:precise does not vectorize and when I use fp:fast it does.

#include <stdio.h>
int main() {
    const int N = 4096;
    float x[N];
    float y[N];
    float sum = 0;
    for (int i = 0; i < N; i++){
        sum += x[i] * y[i];
    }
    printf("sum %f\n", sum);
}

In regards to your question about the loop with the rand() function the rand() function is not a SIMD function. It can't be vectorized. You need to find a SIMD rand() function. I don't know of one. An alternative is pre-compute an array of random numbers and use the array instead. In any case rand() is a horrible random number generate and is only useful for some toy cases. Consider using the Mersenne twister PRNG.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • So, just to make sure.. compile like this: cl filename.cpp /O2 /fp:fast /Qvec-report:2 ? Because that still gives me the same error code. – Jon B. Jones Apr 30 '14 at 08:38
  • @JonB.Jones, you can't vectorize your function with the `rand()` function but if you use `fp:fast` you can vectorize the reduction. I tested it myself. – Z boson Apr 30 '14 at 08:46
  • @JonB.Jones, you get error code 1305 "Not enough type information". I get error code 1105 "Loop includes a non-recognized reduction operation". I'm not sure why we get different error codes. http://blogs.msdn.com/b/nativeconcurrency/archive/2012/05/22/auto-vectorizer-in-visual-studio-11-did-it-work.aspx – Z boson Apr 30 '14 at 08:50
  • @JonB.Jones, without `printf` I get your error code. When I print the value I get no error code and it vectorized. Apparently, the optimizer does not vectorize the loop if you don't use the value. Try adding `printf("%f\n", sum) and `fp:fast`. – Z boson Apr 30 '14 at 09:04
  • 1
    That's fine, I'm not really worried about the loop with the rand() function. As for the other one, adding the printf() function actually did the trick! Thank you so much! – Jon B. Jones Apr 30 '14 at 09:18
  • @Zboson: Thanks for the explanation, I wasn't aware of the associative behavior :) – Jens Apr 30 '14 at 11:07
2

One problem could be that your stack allocation isn't necessarily aligned by your compiler. If your compiler supports c++11 you could use:

float x[N] alignas(16);
float y[N] alignas(16);

To explicitly get 16 byte aligned memory, which is required by most SSE operations.


EDIT:

Even if alignment isn't the issue and your compiler is vectorizing unaligned code you should make this optimization as unaligned SSE operations are very slow compared to their aligned counterparts.

RamblingMad
  • 5,332
  • 2
  • 24
  • 48