7

In another question on SO we tried (and succeeded) to find a way to replace the AVX missing instruction:

 __m256d _mm256_dp_pd(__m256d m1, __m256d m2, const int mask);

Anyone knows the reason why this instruction is missing ? Partial answer here.

Community
  • 1
  • 1
gleeen.gould
  • 599
  • 1
  • 5
  • 22
  • What do you mean "missing"? There are literally millions of things AVX doesn't have instructions for, and there's no rationale other than "the instruction was not as part of AVX" – jalf Apr 16 '13 at 09:27
  • 2
    My bad, "missing" in the sense that I would like AVX to implement it and because the single precision version already exists (_mm256_dp_ps). I would like to understand why they chose not to implement it (philosophical or technical reasons). But you are right, that might not be the best term. – gleeen.gould Apr 16 '13 at 09:30
  • 2
    The dot-product instructions are slow, bozo ISA extensions that accomplish almost nothing except encouraging novice vector programmers to choose bone-headed data layouts. In general, one should avoid horizontal operations whenever possible, and dot products are among the very worst offenders. – Stephen Canon Apr 18 '13 at 16:01
  • 4
    I am reading the above comment (from @StephenCanon) in Sep 2015, but I still wanted to comment. The dot product is one of the most useful and common operations in numerical computations. Of course a vector unit must have operations for that. Making such sweeping statements as mr Canon without supplying any explanation is utterly annoying. – Erik Alapää Sep 05 '15 at 08:55

1 Answers1

13

The underlying reason for this and various other AVX limitations is that architecturally AVX is little more than two SSE execution units side by side - you will notice that virtually no AVX instructions operate horizontally across the boundary between the two 128 bit halves of a vector (which is particularly annoying in the case of vpalignr). In general you effectively just get two 128 bit SSE operations in parallel, which is useful for the majority of instructions which just operate in an element-wise fashion, but not as useful as a proper 256 bit SIMD implementation.

Paul R
  • 208,748
  • 37
  • 389
  • 560