Is Intel Xeon Phi used intrinsics get good performance than Auto-Vectorization?

Question

Intel Xeon Phi provides using the "IMCI" instruction set ,
I used it to do "c = a*b" , like this:

float* x = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float* y = (float*) _mm_malloc(N*sizeof(float), ALIGNMENT) ;
float z[N];
_Cilk_for(size_t i = 0; i < N; i+=16)
{
    __m512 x_1Vec = _mm512_load_ps(x+i);
    __m512 y_1Vec = _mm512_load_ps(y+i);

    __m512 ans = _mm512_mul_ps(x_1Vec, y_1Vec);
    _mm512_store_pd(z+i,ans);

}

And test it's performance , when the N SIZE is 1048576,
it need cost 0.083317 Sec , I want to compare the performance with auto-vectorization
so the other version code like this:

_Cilk_for(size_t i = 0; i < N; i++)
    z[i] = x[i] * y[i];

This version cost 0.025475 Sec(but sometimes cost 0.002285 or less, I don't know why?)
If I change the _Cilk_for to #pragma omp parallel for, the performance will be poor.

so, if the answer like this, why we need to use intrinsics?
Did I make any mistakes any where?
Can someone give me some good suggestion to optimize the code?

which compiler are you using? Auto-vectorization isn't performed by the CPU itself AFAIK, it depends on the optimization — Marco A., May 20 '14 at 11:02
I used intel'c icpc compiler, and used -O3 and -vec-report3 option, I'm sure the loop is Auto-vectorization, but I want to know if auto-vectorization is great than Intrinsics, why we need Intrinsics? — Marcus Wu, May 20 '14 at 11:04
I'm not an expert in this field but auto-vectorization is a compiler optimization, that means: the compiler will try to find a pattern and apply if it suits your code. If you know in advance an intrinsic will suit it, you just use it. They might be equivalent if you get it right or you might get worse performances if you get it wrong. — Marco A., May 20 '14 at 11:11
Thanks a lot! So if I know the right way to use intrinsic, I will get good performance than Auto-vectorization or equal, right? But in fact, it is the opposite. I'm so indissoluble about that. — Marcus Wu, May 20 '14 at 12:40
Why isn't z 64-bit aligned? https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-211C11FD-7076-4926-B4BC-138287C0404F.htm — Marco A., May 20 '14 at 19:10
Because of I want to use array notation like this: z[:], if I use _mm_malloc to allocate it, it can't use z[:] correctly, now, I change the code like this to 64-bit align: __declspec(align(64)) float z[N]; — Marcus Wu, May 21 '14 at 12:07
If data isn't properly aligned you usually get runtime errors though — Marco A., May 21 '14 at 13:15
Did you mean to write _mm512_store_ps instead of _mm512_store_pd? Are you sure that z was computed and not optimized away by the compiler? Printing the sum of z would be one way to make sure. You can use z[0:N] if z is a pointer. — Arch D. Robison, May 22 '14 at 19:07
OP: I would strongly encourage you to revise your title to use proper English. I can't even determine your original intent to correct it myself. — Jeff Hammond, Mar 29 '15 at 22:38

score 3 · Answer 1 · answered May 23 '14 at 19:57

The measurements don't mean much, because of various mistakes.

The code is storing 16 floats as 8 doubles. The _mm512_store_pd should be _mm512_store_ps.
The code is using _mm512_store_... on an unaligned location with address z+i, which may cause a segmentation fault. Use __declspec(align(64)) to fix this.
The arrays x and y are not initialized. That risks introducing random numbers of denormal values, which might impact performance. (I'm not sure if this is an issue for Intel Xeon Phi).
There's no evidence that z is used, hence the optimizer might remove the calculation. I think it is not the case here, but it's a risk with trivial benchmarks like this. Also, allocating a large array on the stack risks stack overflow.
A single run of the examples is probably a poor benchmark, because the time is probably dominated by fork/join overheads of the _Cilk_for. Assuming 120 Cilk workers (the default for 60 4-way threaded cores), there is only about 1048576/120/16 = ~546 iterations per worker. With a clock rate over 1 GHz, that won't take long. In fact, the work in the loop is so small that most likely some workers never get a chance to steal work. That might account for why the _Cilk_for outruns OpenMP. In OpenMP, all the threads must take part in a fork/join for a parallel region to finish.

If the test were written to correct all the mistakes, it would essentially be computing z[:] = x[:]*y[:] on a large array. Because of the wide vector units on Intel(R) Xeon Phi(TM), this becomes a test of memory/cache bandwidth, not ALU speed, since the ALU is quite capable of outrunning memory bandwidth.

Intrinsics are useful for things that can't be expressed as parallel/simd loops, typically stuff needing fancy permutations. For example, I've used intrinsics to do a 16-element prefix-sum operation on MIC (only 6 instructions if I remember correctly).

Those are all good points that the OP should consider (+1). But in regards to using intrinsics with prefix-sum, I have done this http://stackoverflow.com/questions/19494114/parallel-prefix-cumulative-sum-with-sse/19519287#19519287 but ultimately it's not so differently than the dot product in this example: it's memory/cache bound not compute bound. So SIMD (with intrinsics) does not help much for large arrays. — Z boson, May 24 '14 at 06:19
On big-core machines, yes, the prefix-sum is memory bound and essentially pointless to vectorize. But Intel Xeon Phi has slower hardware threads, but with wider vectors, so there the 6-instruction prefix-sum can pay off, at almost 2x the speed of the scalar version even for arrays that do not fit in cache. — Arch D. Robison, May 27 '14 at 14:47
That's interesting! I hope I get a chance to work with the Xeon Phi at some point. I wonder if this will apply to AVX512 cores when they come out after broadwell. — Z boson, May 27 '14 at 16:44

score 0 · Answer 2 · answered May 22 '14 at 14:23

My answer below equally applies to Intel Xeon and Intel Xeon Phi.

Intrinsics-bases solution is most "powerful" just "like" assembly coding is.
- but on the negative side, intrinsics-based solution is usually not (most) portable, not "productivity"- oriented approach and is often non-applicable for established "legacy" software codebases.
- plus it often requires programmer to be low-level and even micro-architecture expert.
However there are approaches alternate to intrinsics/assembly coding. They are:
- A) auto-vectorization (when compiler recognizes some patterns and automatically generate vector code)
- B) "explicit" or user-guided vectorization (when programmer provide some guidance to compiler in terms of what to vectorize and under which conditions, etc; explicit vectorization usually implies using keywords or pragmas)
- C) Using VEC clasess or other kind of intrinsics wrapper libraries or even very specialized compilers. In fact, 2.C is often as bad as intrinsics coding in terms of productivity and legacy code incremental updates)

In your second code snippet you seem to use "explicit" vectorization, which is currently achievable when using Cilk Plus and OpenMP4.0 "frameworks" supported by all recent versions of Intel Compiler and also by GCC4.9. (I said that you seem to use explicit vectorization, because Cilk_for was originally invented for the purpose of multi-threading, however most recent version of Intel Compiler might automatically parallelize and vectorize the loop, when cilk_for is used)

Is Intel Xeon Phi used intrinsics get good performance than Auto-Vectorization?

2 Answers2