uint32_t * uint32_t = uint64_t vector multiplication with gcc

Question

I'm trying to multiply vectors of uint32_t producing the full 64-bit result in an uint64_t vector in gcc. The result I expect is for gcc to emit a single VPMULUDQ instruction. But what gcc outputs as code is horrible shuffling around of the individual uint32_t of the source vectors and then a full 64*64=64 multiplication. Here is what I've tried:

#include <stdint.h>

typedef uint32_t v8lu __attribute__ ((vector_size (32)));
typedef uint64_t v4llu __attribute__ ((vector_size (32)));

v4llu mul(v8lu x, v8lu y) {
    x[1] = 0; x[3] = 0; x[5] = 0; x[7] = 0;
    y[1] = 0; y[3] = 0; y[5] = 0; y[7] = 0;
    return (v4llu)x * (v4llu)y;
}

The first masks out the unwanted parts of the uint32_t vector in the hope that gcc would optimize away the unneeded parts of the 64*64=64 multiplication and then see the masking is pointless as well. No such luck.

v4llu mul2(v8lu x, v8lu y) {
    v4llu tx = {x[0], x[2], x[4], x[6]};
    v4llu ty = {y[0], y[2], y[4], y[6]};
    return tx * ty;
}

Here I try to create a uint64_t vector from scratch with only the used parts set. Again gcc should see the top 32 bits of each uint64_t are 0 and not do a full 64*64=64 multiply. Instead, a lot of extracting and putting back of the values happens, and a 64*64=64 multiply.

v4llu mul3(v8lu x, v8lu y) {
    v4llu t = {x[0] * (uint64_t)y[0], x[2] * (uint64_t)y[2], x[4] * (uint64_t)y[4], x[6] * (uint64_t)y[6]};
    return t;
}

Let's build the result vector by multiplying the parts. Maybe gcc sees that it can use VPMULUDQ to achieve exactly that. No luck, it falls back to 4 IMUL opcodes.

Is there a way to tell gcc what I want it to do (32*32=64 multiplication with everything perfectly placed)?

Note: Inline asm or the intrinsic isn't the answer. Writing the opcode by hand obviously works. But then I would have to write different versions of the code for many target architectures and feature sets. I want gcc to understand the problem and produce the right solution from a single source code.

Are you looking for `v4di __builtin_ia32_pmuludq256 (v8si,v8si)` — Ben, Nov 13 '19 at 13:16
@JL2210: The type promotion rules are not pertinent. The question does not ask for a standard C way to do this. It asks for GCC features. — Eric Postpischil, Nov 13 '19 at 13:21
If you just want to know how to make GCC do what you want, why not use the intrinsic that @Ben proposed? It seems fragile to rely on creating some pattern of code that the version of GCC that you're using right now happens to recognize and emit the code that you want. If you want to *know* it will work, use the intrinsic function that explicitly specifies your intent. — Jason R, Nov 13 '19 at 13:35
@GoswinvonBrederlow: **Why** is the intrinsic not the answer? If it does what you want, why not use it? — Eric Postpischil, Nov 13 '19 at 14:07
@EricPostpischil: As I understand it, with this extension GCC (very reasonably) follows the C practice that the result of the arithmetic operation is the (possibly promoted) operand type. If you want 32x32->64 you'd need to promote one to 64 before the operation and rely on it being optimized correctly. — R.. GitHub STOP HELPING ICE, Nov 13 '19 at 14:14
`mul` and `mul2` are optimized fine with clang: https://godbolt.org/z/d3MAay, `mul3` is not equivalent, since it needs to truncate the results to 32 bits. I guess your options are: a) Use clang, b) use intrinsics, c) provide a patch to gcc which properly optimizes this (or file a bug and hope someone else fixes it). — chtz, Nov 13 '19 at 14:53
@Ben: the standard portable intrinsic is [`_mm256_mul_epu32`](https://www.felixcloutier.com/x86/pmuludq), defined by `immintrin.h` — Peter Cordes, Nov 13 '19 at 14:59
@chtz Added the missing cast to 64bit to mul3. Doesn't make gcc use the pmuludq. — Goswin von Brederlow, Nov 13 '19 at 16:19
@EricPostpischil Because if I wanted to use the intrinsic I would have done so. The goal is to get the compiler to produce the right opcode for the -m specified. The intrinsic will fail to compile if -mavx2 isn't used. — Goswin von Brederlow, Nov 13 '19 at 16:30
@GoswinvonBrederlow: You can test `__AVX2__` with `#if` and use the intrinsic if it is `__AVX2__` is non-zero and other code if it is not. — Eric Postpischil, Nov 14 '19 at 14:05
@EricPostpischil and `__MMX__` and `__SSE__` and `__SSE2__` and `__SSE3__` and `__SSE4__` and `__NEON__` and `__NEON2__` and some 30 other. As said that is not what I want. — Goswin von Brederlow, Nov 14 '19 at 14:08
@GoswinvonBrederlow: “not what I want” and “if I wanted to use the intrinsic I would have done so” are not justifiable reasons. “Because we need to support many different target architectures and writing individual code for each is too costly” is. Edit the question to state your full requirements, based on actual project requirements, not on “wants.” — Eric Postpischil, Nov 14 '19 at 14:13

score 2 · Answer 1 · answered Nov 14 '19 at 14:12

As noted in the comments by chtz both mul1 and mul2 are optimized right by clang. Code similar to mul3 but using a for loop will be optimized too (but not as well).

So to me it looks like the syntax is correct to express what the code should do and gcc simply lacks the smarts so far to optimize this properly.

uint32_t * uint32_t = uint64_t vector multiplication with gcc

1 Answers1