1

I am trying to sort 8 integers with AVX2 inline assembly in GCC. The code to find the ranks works, assuming no duplicate values. But when I try to use vpermd to sort the array using the ranks it fails. I have isolated the issue in this code:

#include <stdio.h>
#include <x86intrin.h>

int main() {
    __m256i C __attribute__ ((aligned (32))); 
    int array[8] __attribute__ ((aligned (32))) = {40,70,60,20,0,30,50,10};
    int ranks[8] __attribute__ ((aligned (32))) = {4, 7, 6, 2, 0,3, 5, 1};
    __asm__ (
        "VMOVDQA %1,%%ymm1;" // loads ranks
        "VMOVDQA %2,%%ymm2;" // loads array
        "VPERMD  %%ymm1,%%ymm2,%%ymm1;" 
        "VMOVDQA %%ymm1,%0;"
        : "=m" (C)
        : "m" (*(__m256i*)array) ,
          "m" (ranks)
    );

    int* c = (int*)&C;
    for(int i=0; i<8; i++)
        printf("%d ",c[i]); 
        // 0 10 50 60 40 20 30 70 

    return 0;
}

You can review the output of this program on Rextester. The output I am getting is:

 0 10 50 60 40 20 30 70   

As you can see they are not properly sorted. Have I misunderstood what vpermd does or written the wrong code?

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
PrincePolka
  • 196
  • 1
  • 9
  • 6
    The thing that goes into a `vpermd` as the shuffle control is a vector of "from"-indices, not "to"-indices. So you don't tell the numbers where to go (which would obviously have a problem with duplicates), you provide locations from where they will be extracted. – harold Dec 02 '17 at 01:51
  • Thanks for the clarification and edit. Is there an instruction with to indices though? – PrincePolka Dec 02 '17 at 02:02
  • 3
    Also, `int* c = (int*)&C;` violates strict aliasing. Why would you use `__m256i` if you're going to use inline asm? Just use another array, or better use intrinsics for the whole thing. See [print a `__m128i` variable](https://stackoverflow.com/a/46752535/224132) for how to correctly access the elements of a `__m128i` / `__m256i`. – Peter Cordes Dec 02 '17 at 06:07
  • 4
    Also, I'd be very surprised if finding the ranks and then shuffling was faster than the usual `vpminsd` / `vpmaxsd` for building comparators for SIMD sorting networks (and `vshufps` for combining data between two vectors, but if you need tiny sorts within one vector, you can use `vpshufd` mostly, and `vpermq` for or `vperm2i128` for lane crossing). See https://webdocs.cs.ualberta.ca/~amaral/papers/furtak-spaa07.pdf for an SSE 4-element-per-register version, and for SIMD merge-sort given sorted small vectors http://www.vldb.org/pvldb/vol8/p1274-inoue.pdf. – Peter Cordes Dec 02 '17 at 06:18
  • 3
    And BTW, no, x86 has no shuffles that send elements *to* a given destination, except for AVX512 scatter stores. I guess Intel *could* have designed an instruction that had some rules for conflict-resolution, and left unwritten elements zeroed, but they didn't do that. (The use-cases are very rare.) – Peter Cordes Dec 02 '17 at 06:22
  • @PeterCordes thank's for your suggestions. I will not use the x86intrin.h – PrincePolka Dec 02 '17 at 19:33
  • 2
    What I'd really suggest is doing the whole thing with intrinsics. Unless you're an expert at x86 assembly instruction selection and micro-optimization, let the compiler choose the instructions while you use intrinsics. For example, did you know why you should never use `vpextrd eax, xmm0, 0`, and what instruction does the same thing but faster? Do you know everything else it says in [Agner Fog's microarch pdf?](http://agner.org/optimize/)? If you have read and understood the whole thing, then by all means write your own asm. You also have to be good at GNU C inline asm constraints... – Peter Cordes Dec 02 '17 at 19:58
  • 1
    Speaking of which, you have a missed optimization already: the first source operand for `vpermd` could have been a memory operand instead of a separate `vmovdqa`. – Peter Cordes Dec 02 '17 at 20:03
  • 1
    I know very little but I do know I can't write better assembly than the compiler and that's not the goal. I am just beginner and this is a learning project. Thanks for the suggestions though I appreciate it – PrincePolka Dec 02 '17 at 20:23
  • 1
    Ok, understood. Messing around is a noble pursuit, but anyone considering it "for real" should read https://gcc.gnu.org/wiki/DontUseInlineAsm carefully. Also, GNU C inline asm is one of the hardest ways to use asm, because you have to get the constraints right to avoid stepping on the compilers toes (or having it step on your toes by assuming your code doesn't modify memory or something, if you just ask for pointers in registers without dummy memory operands.) (See some links in https://stackoverflow.com/tags/inline-assembly/info. – Peter Cordes Dec 03 '17 at 05:27

0 Answers0