Why doesn't the compiler fold xxswapd and vperm?

Question

I've still trying to get my 1 to 2 cpb out of Power8's SHA instructions. This C/C++ code copies the user's message into the message schedule:

void SHA256_SCHEDULE(uint32_t W[64+2], const uint8_t* D)
{
    uint32_t* w = reinterpret_cast<uint32_t*>(W);
    const uint32_t* d = reinterpret_cast<const uint32_t*>(D);
    unsigned int i=0;

    const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
    for (i=0; i<16; i+=4, d+=4, w+=4)
        VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(d, 0), mask), w, 0);

    ...
}

When I compile at -O3 and look at the disassembly I see the following:

100008bc:   99 26 20 7c     lxvd2x  vs33,0,r4
...
100008d0:   57 0a 21 f0     xxswapd vs33,vs33
100008d8:   2b 08 21 10     vperm   v1,v1,v1,v0

I believe what is happening is:

load occurs at 100008bc (lxvd2x)
le-to-be conversion occurs at 100008d0 (xxswapd)
my permutation is applied at 100008d8 (vperm)

At (1) the VSX register has the value but it is in little-endian format. Elements 0 and 4 need swapped; and elements 2 and 3 need swapped.

At (2) and (3) two permutations are being applied. It is kind of like calling shuffle_epi32 followed by shuffle_epi8 on an x86 machine.

PowerPC's vec_perm is very powerful, and any two permutations can be folded into one permutation.

My first question is, why are the two permutations not being folded into one?

My second question is, how can I force the compiler to perform the folding?

I'm trying my best to avoid inline assembly because the code supports GCC, Clang and IBM's XL C/C++. IBM's XL C/C++ does not support inline assembly as well as GCC and Clang, so it is going to be a painful path.

Here is the full disassembly:

0000000010000880 <SHA256_SCHEDULE(unsigned int*, unsigned char const*)>:
    10000880:   03 10 40 3c     lis     r2,4099
    10000884:   00 81 42 38     addi    r2,r2,-32512
    10000888:   f0 ff c1 fb     std     r30,-16(r1)
    1000088c:   f8 ff e1 fb     std     r31,-8(r1)
    10000890:   fe ff 22 3d     addis   r9,r2,-2
    10000894:   10 00 c4 3b     addi    r30,r4,16
    10000898:   80 8e 29 39     addi    r9,r9,-29056
    1000089c:   10 00 e3 3b     addi    r31,r3,16
    100008a0:   20 00 84 39     addi    r12,r4,32
    100008a4:   20 00 63 39     addi    r11,r3,32
    100008a8:   99 4e 00 7c     lxvd2x  vs32,0,r9
    100008ac:   30 00 a3 38     addi    r5,r3,48
    100008b0:   40 00 23 39     addi    r9,r3,64
    100008b4:   c4 ff c0 38     li      r6,-60
    100008b8:   c0 ff e0 38     li      r7,-64
    100008bc:   99 26 20 7c     lxvd2x  vs33,0,r4
    100008c0:   30 00 84 38     addi    r4,r4,48
    100008c4:   f8 ff 00 39     li      r8,-8
    100008c8:   e4 ff 40 39     li      r10,-28
    100008cc:   57 02 00 f0     xxswapd vs32,vs32
    100008d0:   57 0a 21 f0     xxswapd vs33,vs33
    100008d4:   97 05 00 f0     xxlnand vs32,vs32,vs32
    100008d8:   2b 08 21 10     vperm   v1,v1,v1,v0
    100008dc:   57 0a 21 f0     xxswapd vs33,vs33
    100008e0:   99 1f 20 7c     stxvd2x vs33,0,r3
    100008e4:   18 00 60 38     li      r3,24
    100008e8:   a6 03 69 7c     mtctr   r3
    100008ec:   99 f6 20 7c     lxvd2x  vs33,0,r30
    100008f0:   57 0a 21 f0     xxswapd vs33,vs33
    100008f4:   2b 08 21 10     vperm   v1,v1,v1,v0
    100008f8:   57 0a 21 f0     xxswapd vs33,vs33
    100008fc:   99 ff 20 7c     stxvd2x vs33,0,r31
    10000900:   99 66 20 7c     lxvd2x  vs33,0,r12
    10000904:   57 0a 21 f0     xxswapd vs33,vs33
    10000908:   2b 08 21 10     vperm   v1,v1,v1,v0
    1000090c:   57 0a 21 f0     xxswapd vs33,vs33
    10000910:   99 5f 20 7c     stxvd2x vs33,0,r11
    10000914:   99 26 20 7c     lxvd2x  vs33,0,r4
    10000918:   57 0a 21 f0     xxswapd vs33,vs33
    1000091c:   2b 08 01 10     vperm   v0,v1,v1,v0
    10000920:   57 02 00 f0     xxswapd vs32,vs32
    10000924:   99 2f 00 7c     stxvd2x vs32,0,r5
    10000928:   00 00 00 60     nop
    1000092c:   00 00 42 60     ori     r2,r2,0
    10000930:   99 36 09 7c     lxvd2x  vs32,r9,r6
    10000934:   99 3e 89 7d     lxvd2x  vs44,r9,r7
    10000938:   99 56 a9 7d     lxvd2x  vs45,r9,r10
    1000093c:   99 46 29 7c     lxvd2x  vs33,r9,r8
    10000940:   82 06 00 10     vshasigmaw v0,v0,0,0
    10000944:   82 7e 21 10     vshasigmaw v1,v1,0,15
    10000948:   80 60 00 10     vadduwm v0,v0,v12
    1000094c:   80 68 00 10     vadduwm v0,v0,v13
    10000950:   80 08 00 10     vadduwm v0,v0,v1
    10000954:   99 4f 00 7c     stxvd2x vs32,0,r9
    10000958:   08 00 29 39     addi    r9,r9,8
    1000095c:   d4 ff 00 42     bdnz    10000930 <SHA256_SCHEDULE(unsigned int*, unsigned char const*)+0xb0>
    10000960:   f0 ff c1 eb     ld      r30,-16(r1)
    10000964:   f8 ff e1 eb     ld      r31,-8(r1)
    10000968:   20 00 80 4e     blr
    1000096c:   00 00 00 00     .long 0x0
    10000970:   00 09 00 00     .long 0x900
    10000974:   00 02 00 00     attn
    10000978:   00 00 00 60     nop
    1000097c:   00 00 42 60     ori     r2,r2,0

*why are the two permutations not being folded into one?* Presumably because your compiler doesn't know how to do that optimization. Missed-optimizations are common even in fairly good compilers. On x86, I'd expect clang to do it (because it has a pretty good shuffle optimizer), but I forget if gcc knows how to combine earlier or later `_mm_shuffle_epi32` intrinsics with `_mm_shuffle_epi8` with a constant vector into a single`pshufb` instruction. — Peter Cordes, Mar 07 '18 at 22:47
Thanks @Peter. We reported a GCC bug at [Issue 84753](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=84753), but it was closed immediately as fixed even though we are seeing it under GCC 4.8, GCC 7.2 and IBM XL C/C++ 13.1. There's something fishy going on here. The IBM team knows damn well that bug is present and not fixed. — jww, Mar 07 '18 at 22:59
Your gcc bug should include a test-case that doesn't optimize away. e.g. modify some global arrays, or write a function that takes pointer args. IDK why you included a `main` function at all. I usually include a godbolt link for missed-optimization bugs, but unfortunately Godbolt only has PPC gcc 6.3. I don't have PPC compilers installed locally, so I can't test it easily. — Peter Cordes, Mar 07 '18 at 23:10
Thanks @Peter. There's a link to the real code that is experiencing the issue. The link is in the problem description. The code is my MCVE for the port. There's also a disassembly that shows the problem. — jww, Mar 07 '18 at 23:14
Your gcc bug report doesn't have any complete block I could copy-paste and compile to repro the bug. Thus, it's not a MCVE. It has a tiny snippet which is M but not C or V. It has a `main` which optimizes away and thus is C but not V. It has a link to your original source which is C and V but not M. It should be easy to make one short function which has all 3 properties, **so the gcc devs can easily test it on their own machines with a current build of gcc**. — Peter Cordes, Mar 07 '18 at 23:40
Thanks @Peter. Everything that is needed is in the bug report and the port that is experiencing the problem. I'm pretty sure the compilers are broke and the SHA performance sucks. Even OpenSSL's numbers are off. Everything else is just IBM covering it up, like closing the bug report as "already fixed". — jww, Mar 07 '18 at 23:46

Why doesn't the compiler fold xxswapd and vperm?

0 Answers0