Convert _mm_clmulepi64_si128 to vmull_{high}_p64

Question

I have the following Intel PCLMULQDQ intrinsic (a carryless multiply):

__m128i a, b;   // Set to some value
__m128i r = _mm_clmulepi64_si128(a, b, 0x10);

The 0x10 tells me the multiply is:

r = a[63:0] * b[127:64]

I need to convert it to NEON (or more correctly, use the Crypto extension):

poly64_t a, b;   // Set to some value
poly16x8_t = vmull_p64(...) or vmull_high_p64(...);

I think vmull_p64 works on the low 64-bits, while vmull_high_p64 operates on the high 64 bits. I think I need to shift one of the values 128-bit values to mimic _mm_clmulepi64_si128(a, b, 0x10). The docs for PMULL, PMULL2 (vector) are not too clear, and I'm not sure what the result will be because I don't understand 2's arrangement specifier. The ARM ACLE 2.0 is not too helpful either:

poly128_t vmull_p64 (poly64_t, poly64_t);
Performs widening polynomial multiplication on double-words low part. Available on ARMv8 AArch32 and AArch64.
poly128_t vmull_high_p64 (poly64x2_t, poly64x2_t);
Performs widening polynomial multiplication on double-words high part. Available on ARMv8 AArch32 and AArch64.

How do I convert _mm_clmulepi64_si128 to vmull_{high}_p64?

For anyone contemplating the investment in NEON, PMULL and PMULL2... The 64-bit multiplier and polynomial support is worth it. Benchmarks show GCC code for GMAC went from 12.7 cpb and 90 MB/s (C/C++) down to 1.6 cpb and 670 MB/s (NEON and PMULL{2}).

Peter Cordes · Accepted Answer · 2016-07-25T08:35:44.050

Since you clarified the source of your confusion with a comment:

A full multiply produces a result twice as wide as the inputs. An add can produce at most one carry-out bit, but a mul produces a whole upper half.

A multiply is exactly equivalent to shifts + adds, and those shifts bring bits from one operand up to as high as 2N - 1 (when the inputs are N bits wide). See Wikipedia's example.

In a normal integer multiply (with carry in the add steps) like x86's mul instruction, carry from the partial sums can set the high bit, so the result is exactly twice as wide.

XOR is add without carry, so a carryless multiply is the same shift-and-add algo, but with XOR instead of add-with-carry. In a carryless multiply, there's no carry, so the top bit of the full-width result is always zero. Intel even makes this explicit in the Operation section of the x86 insn ref manual for pclmuludq: DEST[127] ← 0;. That section documents precisely all the shifting and XORing that produces the result.

The PMULL[2] docs seem pretty clear to me. The destination has to be a .8H vector (which means eight 16-bit (Halfword) elements). The sources for PMULL have to be .8B vectors (8 one-Byte elements), while the sources for PMULL2 have to be .16B (16 one-Byte elements, of which only the upper 8 of each source are used).

If this was ARM32 NEON, where the upper half of each 16B vector register is an odd-numbered narrower register, PMULL2 wouldn't be useful for anything.

There's no "operation" section to describe exactly which bits multiply with which other bits, though. Fortunately, the paper linked in comments nicely summarizes the available instructions for ARMv7, and ARMv8 32 and 64 bit. The .8B / .8H organization specifiers seem to be bogus, because PMULL does perform a single 64x64 -> 128 carryless mul like SSE's pclmul instruction. The ARMv7 VMULL.P8 NEON insn does do a packed 8x8->16, but makes it clear that PMULL (and ARMv8 AArch32 VMULL.P8) are different.

It's too bad the ARM doc doesn't say any of this; it seems horribly lacking, esp. re the misleading .8B vector organization stuff. That paper shows an example using the expected .1q and .1d (and .2d) organizations, so maybe the assembler doesn't care what you think your data means, as long as it's the right size.

To do a high-with-low multiply, you need to shift one of them.

For example, if you need all four combinations (a0*b0, a1*b0, a0*b1, a1*b1), like you do to build a 128x128 -> 128 multiply out of 64x64 -> 128 multiplies (with Karatsuba), you could do it like this:

pmull   a0b0.8H, a.8B,  b.8B
pmull2  a1b1.8H, a.16B, b.16B
swap a's top and bottom half, which I assume can be done efficiently somehow
pmull   a1b0.8H, swapped_a.8B,  b.8B
pmull2  a0b1.8H, swapped_a.16B, b.16B

So it looks like ARM's design choice to include lower-lower and upper-upper, but not cross-multiply instructions (or a selector constant like x86 does) doesn't cause much inefficiency. And since ARM instructions can't just tack on extra immediates the way x86's variable-length machine encoding can, that probably wasn't an option.

Another version of the same thing, with a real shuffle instruction and Karatsuba afterwards (copied verbatim from Implementing GCM on ARMv8). But still made-up register names. The paper reuses the same temporary register along the way, but I've named them the way I might for a C intrinsics version. This makes the operation of the extended-precision multiply pretty clear. The compiler can reuse dead registers for us.

1:  pmull    a0b0.1q, a.1d, b.1d
2:  pmull2   a1b1.1q, a.2d, b.2d
3:  ext.16b  swapped_b, b, b, #8
4:  pmull    a0b1.1q, a.1d, swapped_b.1d
5:  pmull2   a1b0.1q, a.2d, swapped_b.2d
6:  eor.16b  xor_cross_muls, a0b1, a1b0
7:  ext.16b  cross_low,      zero, xor_cross_muls, #8
8:  eor.16b  result_low,     a0b0, cross_low
9:  ext.16b  cross_high,     xor_cross_muls, zero, #8
10: eor.16b  result_high,    a1b1, cross_high

The confusing thing for me is [PMULL, PMULL2 (vector)](http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0802a/PMULL_advsimd_vector.html) docs state the specifier is *`Ta=8H, Tb=16B`*. It does not make a lot of sense to me for a 64-bit x 64-bit multiply that's supposed to result in a 128-bit product. I'm also not arriving at expected results with sample programs, so I have not been able to gather an understand what's going on. — jww, Jul 25 '16 at 02:26
@jww: A full multiply normally produces a result twice as wide as the inputs. An add can produce at most one carry-out bit, but a mul produces a whole upper half because it's equivalent to shifts + adds. Look at [x86's `mul` instruction](http://www.felixcloutier.com/x86/MUL.html), or the full-multiply instruction on any other architecture. — Peter Cordes, Jul 25 '16 at 02:54
@jww: updated my answer with a link to wikipedia for an example of binary multiplication. This source of confusion wasn't at all clear in your question. It sounded like you knew how `_mm_clmulepi64_si128` worked, and what the selector was for. — Peter Cordes, Jul 25 '16 at 03:07
You were right... Once I stopped working on [real data produced by GCM's multiply](http://conradoplg.cryptoland.net/files/2010/12/gcm14.pdf) (hacking in an occasional NULL vector to observe results), the patterns dropped out and it became a mindless operation. — jww, Jul 25 '16 at 06:33
@jww: Thanks, that paper has some nice background on what the instructions do (Fig1). Apparently NEON `vmull.p8` does do parallel single-byte (widening) multiplies, but `pmull(2)` is a 64x64->128. I don't understand the `16B` / `8H` organization naming. :/ `vmull.p8` would be useful for the Galois-Field multiplies in Reed-Solomon style error-correction codes (like `par2` uses, for example) But I remember looking at pclmul and figuring it wasn't useful for multiple narrow multiplies. (par2 needs 16b * 16b -> 16b Galois-Field multiplies, normally done with two 256-entry LUTs of 16b results). — Peter Cordes, Jul 25 '16 at 07:30
@jww: [Last time I wanted to try some ARM code, I just used qemu](http://codegolf.stackexchange.com/a/79212/30206). But thanks for the suggestions. I might consider getting an ARM board to replace my home router / mail server. (Currently a PIII-450 running Debian GNU/Linux on a 13GB IDE hard drive :) — Peter Cordes, Jul 25 '16 at 08:08
The 64-bit multiplier and polynomial support is bad ass... Benchmarks show GCC code for GMAC went from 12.7 cpb and 90 MB/s (C/C++) down to 1.6 cpb and 670 MB/s (NEON and PMULL{2}). — jww, Jul 26 '16 at 19:40
*"It's too bad the ARM doc doesn't say any of this; it seems horribly lacking..."* - Lol, you should see the IBM docs for the Power8 in-crypto gear. They make ARM docs look like maximum verbosity. — jww, Sep 18 '17 at 11:24

jww · Answer 2 · 2016-07-26T01:26:46.620

How do I convert _mm_clmulepi64_si128 to vmull_{high}_p64?

Here are results from the sample program below. The conversions are:

_mm_clmulepi64_si128(a, b, 0x00) → vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b, 0))
_mm_clmulepi64_si128(a, b, 0x01) → vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b, 0))
_mm_clmulepi64_si128(a, b, 0x10) → vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b, 1))
_mm_clmulepi64_si128(a, b, 0x11) → vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b, 1))

For case (4), _mm_clmulepi64_si128(a, b, 0x11), the following also holds:

_mm_clmulepi64_si128(a, b, 0x11) → vmull_high_p64((poly64x2_t)a, (poly64x2_t)b)

I'm guessing the cases (1) through (4) can spill out into memory if not careful because vgetq_lane_u64 returns a scalar or non-vector type. I'm also guessing case (5) has a propensity to stay in the Q registers because its a vector type.

x86_64 and _mm_clmulepi64_si128:

$ ./mul-sse-neon.exe
IS_X86: true
****************************************
clmulepi64(a, b, 0x00)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x606060606060606, r[1]: 0x606060606060606
****************************************
clmulepi64(a, b, 0x01)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xc0c0c0c0c0c0c0c, r[1]: 0xc0c0c0c0c0c0c0c
****************************************
clmulepi64(a, b, 0x10)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xa0a0a0a0a0a0a0a, r[1]: 0xa0a0a0a0a0a0a0a
****************************************
clmulepi64(a, b, 0x11)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x1414141414141414, r[1]: 0x1414141414141414

ARM64 and vmull_p64:

$ ./mul-sse-neon.exe 
IS_ARM: true
****************************************
vmull_p64(a, b, 0x00)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x606060606060606, r[1]: 0x606060606060606
****************************************
vmull_p64(a, b, 0x01)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xa0a0a0a0a0a0a0a, r[1]: 0xa0a0a0a0a0a0a0a
****************************************
vmull_p64(a, b, 0x10)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0xc0c0c0c0c0c0c0c, r[1]: 0xc0c0c0c0c0c0c0c
****************************************
vmull_p64(a, b, 0x11)
a[0]: 0x2222222222222222, a[1]: 0x4444444444444444
b[0]: 0x3333333333333333, b[1]: 0x5555555555555555
r[0]: 0x1414141414141414, r[1]: 0x1414141414141414

The sample program mul-sse-neon.cc:

#define IS_ARM (__arm__ || __arm32__ || __aarch32__ || __arm64__ || __aarch64__)
#define IS_X86 (__i386__ || __i586__ || __i686__ || __amd64__ || __x86_64__)

#if (IS_ARM)
# include <arm_neon.h>
# if defined(__ARM_ACLE) || defined(__GNUC__)
#  include <arm_acle.h>
# endif
#endif

#if (IS_X86)
# include <emmintrin.h>
# if defined(__GNUC__)
#  include <x86intrin.h>
# endif
#endif

#if (IS_ARM)
typedef uint64x2_t word128;
#elif (IS_X86)
typedef __m128i word128;
#else
# error "Need a word128"
#endif

#include <stdio.h>
#include <stdint.h>
#include <inttypes.h>

void print_val(const word128* value, const char* label);

/* gcc -DNDEBUG -g3 -O0 -march=native mul-sse-neon.cc -o mul-sse-neon.exe */
/* gcc -DNDEBUG -g3 -O0 -march=armv8-a+crc+crypto mul-sse-neon.cc -o mul-sse-neon.exe */

int main(int argc, char* argv[])
{
#if (IS_ARM)
    printf("IS_ARM: true\n");
#elif (IS_X86)
    printf("IS_X86: true\n");
#endif

    word128 a,b, r;
    a[0] = 0x2222222222222222, a[1] = 0x4444444444444444;
    b[0] = 0x3333333333333333, b[1] = 0x5555555555555555;

#if (IS_ARM)

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x00)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b,0));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x01)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 0), vgetq_lane_u64(b,1));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x10)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b,0));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("vmull_p64(a, b, 0x11)\n");
    r = (uint64x2_t)vmull_p64(vgetq_lane_u64(a, 1), vgetq_lane_u64(b,1));
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

#elif (IS_X86)

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x00)\n");
    r = _mm_clmulepi64_si128(a, b, 0x00);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x01)\n");
    r = _mm_clmulepi64_si128(a, b, 0x01);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x10)\n");
    r = _mm_clmulepi64_si128(a, b, 0x10);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

    printf("****************************************\n");
    printf("clmulepi64(a, b, 0x11)\n");
    r = _mm_clmulepi64_si128(a, b, 0x11);
    print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

#endif

    return 0;
}

static const word128 s_v = {0,0};
static const char s_l[] = "";
void print_val(const word128* value, const char* label)
{
    const word128* v = (value ? value : &s_v);
    const char* l = (label ? label : s_l);

#if (IS_ARM)
    printf("%s[0]: 0x%" PRIx64 ", %s[1]: 0x%" PRIx64 "\n", l, (*v)[0], l, (*v)[1]);
#elif (IS_X86)
    printf("%s[0]: 0x%" PRIx64 ", %s[1]: 0x%" PRIx64 "\n", l, (*v)[0], l, (*v)[1]);
#endif
}

The code for vmull_high_p64 is as follows. It always produces the same result because its always taking the same high words:

printf("****************************************\n");
printf("vmull_p64(a, b)\n");
r = (uint64x2_t)vmull_high_p64((poly64x2_t)a, (poly64x2_t)b);
print_val(&a, "a"); print_val(&b, "b"); print_val(&r, "r");

For completeness, switching the data to:

word128 a,b, r;
a[0] = 0x2222222233333333, a[1] = 0x4444444455555555;
b[0] = 0x6666666677777777, b[1] = 0x8888888899999999;

Produces the following results:

$ ./mul-sse-neon.exe
IS_X86: true
****************************************
clmulepi64(a, b, 0x00)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0xd0d0d0d09090909, r[1]: 0xc0c0c0c08080808
****************************************
clmulepi64(a, b, 0x01)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x191919191b1b1b1b, r[1]: 0x181818181a1a1a1a
****************************************
clmulepi64(a, b, 0x10)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x111111111b1b1b1b, r[1]: 0x101010101a1a1a1a
****************************************
clmulepi64(a, b, 0x11)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x212121212d2d2d2d, r[1]: 0x202020202c2c2c2c

And:

$ ./mul-sse-neon.exe 
IS_ARM: true
****************************************
vmull_p64(a, b, 0x00)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0xd0d0d0d09090909, r[1]: 0xc0c0c0c08080808
****************************************
vmull_p64(a, b, 0x01)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x111111111b1b1b1b, r[1]: 0x101010101a1a1a1a
****************************************
vmull_p64(a, b, 0x10)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x191919191b1b1b1b, r[1]: 0x181818181a1a1a1a
****************************************
vmull_p64(a, b, 0x11)
a[0]: 0x2222222233333333, a[1]: 0x4444444455555555
b[0]: 0x6666666677777777, b[1]: 0x8888888899999999
r[0]: 0x212121212d2d2d2d, r[1]: 0x202020202c2c2c2c

I'd upvote this if you either verify that compilers manage to make good AArch64 asm from your intrinsics, or replace the `vgetq_lane_u64` stuff with whatever the intrinsic is for `ext.16b swapped_b, b, b, #8` to swap upper and lower halves of `b` on AArch64. I find the "this works but might compile to crap" aspect not very satisfying. — Peter Cordes, Jul 26 '16 at 05:10
@PeterCordes - Those caveats kinda need to be there for GCC. Also see [How to stop GCC from breaking my NEON intrinsics?](http://stackoverflow.com/q/34901932) I even asked the GCC folks where their *NEON Guide* was in an attempt to find patterns that were easier for GCC to recognize in hopes of getting better code. (There's no guide, btw). — jww, Jul 26 '16 at 05:23
@PeterCordes - One other thing I've found... Clang generates much better code than GCC. If you have a choice, use Clang over GCC. I'm still trying to determine [how to activate the code paths for Visual Studio](http://stackoverflow.com/q/37244202). Windows 10 is supposed to have the IoT gadget support, but the tools do not appear to have been released yet. — jww, Jul 26 '16 at 05:30
I've found that for x86, gcc makes code that usually uses the instructions that match your intrinsics, while clang will optimize without regard to using the same shuffles, etc. If you have an optimal asm implementation in mind when writing the intrinsics version, gcc does fine. If not, then yeah clang is great. Oh hmm, now that I read the question you linked: instruction-scheduling doesn't matter much on x86. It's not something I've looked at, because I know gcc doesn't even try (much or at all) to schedule when targeting x86. — Peter Cordes, Jul 26 '16 at 05:48
_I'm guessing the cases (1) through (4) can spill out into memory_ If you switch from using `vgetq_lane_u64(a, 0)` to `vget_low_p64(a)` it will use the correct vector type `poly64x1_t` which resides in a simd `d` vector and should not spill (although there's a solid chance it avoids the spill even as written) — Steve Cox, Aug 11 '20 at 13:05

Convert _mm_clmulepi64_si128 to vmull_{high}_p64

2 Answers2