I've still trying to get my 1 to 2 cpb out of Power8's SHA instructions. This C/C++ code copies the user's message into the message schedule:
void SHA256_SCHEDULE(uint32_t W[64+2], const uint8_t* D)
{
uint32_t* w = reinterpret_cast<uint32_t*>(W);
const uint32_t* d = reinterpret_cast<const uint32_t*>(D);
unsigned int i=0;
const uint8x16_p8 mask = {3,2,1,0, 7,6,5,4, 11,10,9,8, 15,14,13,12};
for (i=0; i<16; i+=4, d+=4, w+=4)
VectorStore32x4u(VectorPermute32x4(VectorLoad32x4u(d, 0), mask), w, 0);
...
}
When I compile at -O3
and look at the disassembly I see the following:
100008bc: 99 26 20 7c lxvd2x vs33,0,r4
...
100008d0: 57 0a 21 f0 xxswapd vs33,vs33
100008d8: 2b 08 21 10 vperm v1,v1,v1,v0
I believe what is happening is:
- load occurs at
100008bc
(lxvd2x
) - le-to-be conversion occurs at
100008d0
(xxswapd
) - my permutation is applied at
100008d8
(vperm
)
At (1) the VSX register has the value but it is in little-endian format. Elements 0 and 4 need swapped; and elements 2 and 3 need swapped.
At (2) and (3) two permutations are being applied. It is kind of like calling shuffle_epi32
followed by shuffle_epi8
on an x86 machine.
PowerPC's vec_perm
is very powerful, and any two permutations can be folded into one permutation.
My first question is, why are the two permutations not being folded into one?
My second question is, how can I force the compiler to perform the folding?
I'm trying my best to avoid inline assembly because the code supports GCC, Clang and IBM's XL C/C++. IBM's XL C/C++ does not support inline assembly as well as GCC and Clang, so it is going to be a painful path.
Here is the full disassembly:
0000000010000880 <SHA256_SCHEDULE(unsigned int*, unsigned char const*)>:
10000880: 03 10 40 3c lis r2,4099
10000884: 00 81 42 38 addi r2,r2,-32512
10000888: f0 ff c1 fb std r30,-16(r1)
1000088c: f8 ff e1 fb std r31,-8(r1)
10000890: fe ff 22 3d addis r9,r2,-2
10000894: 10 00 c4 3b addi r30,r4,16
10000898: 80 8e 29 39 addi r9,r9,-29056
1000089c: 10 00 e3 3b addi r31,r3,16
100008a0: 20 00 84 39 addi r12,r4,32
100008a4: 20 00 63 39 addi r11,r3,32
100008a8: 99 4e 00 7c lxvd2x vs32,0,r9
100008ac: 30 00 a3 38 addi r5,r3,48
100008b0: 40 00 23 39 addi r9,r3,64
100008b4: c4 ff c0 38 li r6,-60
100008b8: c0 ff e0 38 li r7,-64
100008bc: 99 26 20 7c lxvd2x vs33,0,r4
100008c0: 30 00 84 38 addi r4,r4,48
100008c4: f8 ff 00 39 li r8,-8
100008c8: e4 ff 40 39 li r10,-28
100008cc: 57 02 00 f0 xxswapd vs32,vs32
100008d0: 57 0a 21 f0 xxswapd vs33,vs33
100008d4: 97 05 00 f0 xxlnand vs32,vs32,vs32
100008d8: 2b 08 21 10 vperm v1,v1,v1,v0
100008dc: 57 0a 21 f0 xxswapd vs33,vs33
100008e0: 99 1f 20 7c stxvd2x vs33,0,r3
100008e4: 18 00 60 38 li r3,24
100008e8: a6 03 69 7c mtctr r3
100008ec: 99 f6 20 7c lxvd2x vs33,0,r30
100008f0: 57 0a 21 f0 xxswapd vs33,vs33
100008f4: 2b 08 21 10 vperm v1,v1,v1,v0
100008f8: 57 0a 21 f0 xxswapd vs33,vs33
100008fc: 99 ff 20 7c stxvd2x vs33,0,r31
10000900: 99 66 20 7c lxvd2x vs33,0,r12
10000904: 57 0a 21 f0 xxswapd vs33,vs33
10000908: 2b 08 21 10 vperm v1,v1,v1,v0
1000090c: 57 0a 21 f0 xxswapd vs33,vs33
10000910: 99 5f 20 7c stxvd2x vs33,0,r11
10000914: 99 26 20 7c lxvd2x vs33,0,r4
10000918: 57 0a 21 f0 xxswapd vs33,vs33
1000091c: 2b 08 01 10 vperm v0,v1,v1,v0
10000920: 57 02 00 f0 xxswapd vs32,vs32
10000924: 99 2f 00 7c stxvd2x vs32,0,r5
10000928: 00 00 00 60 nop
1000092c: 00 00 42 60 ori r2,r2,0
10000930: 99 36 09 7c lxvd2x vs32,r9,r6
10000934: 99 3e 89 7d lxvd2x vs44,r9,r7
10000938: 99 56 a9 7d lxvd2x vs45,r9,r10
1000093c: 99 46 29 7c lxvd2x vs33,r9,r8
10000940: 82 06 00 10 vshasigmaw v0,v0,0,0
10000944: 82 7e 21 10 vshasigmaw v1,v1,0,15
10000948: 80 60 00 10 vadduwm v0,v0,v12
1000094c: 80 68 00 10 vadduwm v0,v0,v13
10000950: 80 08 00 10 vadduwm v0,v0,v1
10000954: 99 4f 00 7c stxvd2x vs32,0,r9
10000958: 08 00 29 39 addi r9,r9,8
1000095c: d4 ff 00 42 bdnz 10000930 <SHA256_SCHEDULE(unsigned int*, unsigned char const*)+0xb0>
10000960: f0 ff c1 eb ld r30,-16(r1)
10000964: f8 ff e1 eb ld r31,-8(r1)
10000968: 20 00 80 4e blr
1000096c: 00 00 00 00 .long 0x0
10000970: 00 09 00 00 .long 0x900
10000974: 00 02 00 00 attn
10000978: 00 00 00 60 nop
1000097c: 00 00 42 60 ori r2,r2,0