Subarrays in inline-assembly. C++

Question

void new2d(int* aInit, int* aResult)
{
        int cyclic[34] = {0};
        for (int i = 0; i < 32; i++)
        {
                cyclic[i] = aInit[i];
        }
        cyclic[32] = aInit[0];
        cyclic[33] = aInit[1];
        float three = 3.0;
        for (int i = 0; i < 32; i += 4)
        {
                int j = i + 1;
                int k = j + 1;
                __asm__ __volatile__
                (
                "vmovdqa (%0), %%xmm0;"
                "vmovdqa (%1), %%xmm1;"
                "vcvtdq2ps %%xmm0, %%xmm0;"
                "vcvtdq2ps %%xmm1, %%xmm1;"
                "addps %%xmm0, %%xmm1;"
                "vmovdqa (%2), %%xmm1;"
                "vcvtdq2ps %%xmm1, %%xmm1;"
                "addps %%xmm0, %%xmm1;"
                "vbroadcastss (%3), %%xmm1;"
                "divps %%xmm0, %%xmm1;"
                "vcvtps2dq %%xmm0, %%xmm0;"
                "vmovdqa %%xmm0, (%4);"
                :
                : "a"(&(cyclic[i])), "b"(&(cyclic[j])), "c"(&(cyclic[k])), "d"(&three), "S"(&aResult[i])
                );
        }
}

Trying to add up three subarrays of one initial array, find their means, and save them in the result array. Though, after launching .exe file, it shows Segmentation Fault. How can I fix it? Using GNU 2.9.3, Ubuntu

Note that your inline assembly is invalid as it doesn't declare appropriate clobbers. This kind of task is much better done with intrinsic functions. — fuz, Apr 30 '20 at 08:18
Even if you were using inline asm safely, you're shooting yourself in the foot with many missed optimizations, e.g. hoisting the load of `3.0` out of the loop, and multiplying by `1.0/3` instead of using slow division. Also using memory source operands for `vcvtdq2ps`. And if your integers won't overflow, doing integer adds instead of converting to FP and back. Also, `j`'s load `cyclic[i+1 + 0..3]` could be generated with a shuffle like maybe `vpalignr` from the `cyclic[i]` vector and next iteration's `cyclic[i]` vector. Unaligned load throughput is generally excellent so maybe not worth it — Peter Cordes, Apr 30 '20 at 08:38
If you'd used intrinsics, compilers could do some of these for you, e.g. hoisting the `vbroadcastss` (`_mm_set1_ps(3.0f)`) out of the loop. — Peter Cordes, Apr 30 '20 at 08:40

Andrey Semashev · Accepted Answer · 2020-04-30T09:58:50.053

1

You're using vmovdqa instruction, which requires an aligned memory operand, on an unaligned element of the array. Use vmovdqu instead, for both loads and stores. Or better yet, use memory operands in the actual computation instructions (this is only valid in AVX though; in legacy SSE memory operands of most instructions must be aligned).

There are other inefficiencies and problems with the assembler block. For instance, as mentioned in the comments, you're missing the clobbers, which indicate pieces of CPU and memory state that may be modified by the asm block. In your case, you're missing "memory", "xmm0" and "xmm1" clobbers. Without these, the compiler will assume that the asm block does not affect memory contents (the aResult array in particular) or the xmm registers (and, for example, use these registers for its own purposes in conflict with your asm block).

Also, you seem to have messed up the input and output registers in addps and divps instructions as you are overwriting or not using results of the previous instructions in a few instances. In AT&T x86 asm syntax used by gcc, the last operand is the output operand. You should normally use the AVX version of every instruction when using any AVX instructions, although mixing 128-bit AVX instructions with legacy SSE won't cause SSE/AVX transition stalls if the upper halves of YMM registers were already clean (e.g. vzeroupper). Use vaddps / vdivps is optional but recommended.

Also, you're passing references to the input and output arrays inefficiently. Instead of passing pointers to the specific elements of the arrays it is more efficient to pass memory references to those, which allows the compiler to use more elaborate memory referencing arguments than a plain pointer. This removes the need to calculate the pointer in a separate instruction before your asm block. Also, it is more efficient to pass the tree constant in an xmm register instead of memory. Ideally, you would want to move vbroadcastss out of the loop, but that is only possible with support for intrinsics. (Or writing the loop inside one asm statement.)

The corrected and improved asm statement would look like this:

__asm__ __volatile__
(
    "vcvtdq2ps %1, %%xmm0;"
    "vcvtdq2ps %2, %%xmm1;"
    "vaddps %%xmm1, %%xmm0;"
    "vcvtdq2ps %3, %%xmm1;"
    "vaddps %%xmm1, %%xmm0;"
    "vbroadcastss %4, %%xmm1;"
    "vdivps %%xmm0, %%xmm1;"
    "vcvtps2dq %%xmm1, %0;"
    : "=m"(aResult[i])
    : "m"(cyclic[i]), "m"(cyclic[j]), "m"(cyclic[k]), "x"(three)
    : "memory", "xmm0", "xmm1"
);

(It actually doesn't have to be volatile anymore, now that the memory outputs are explicit operands.)

But a better yet solution would be to use intrinsics to implement this asm block. Not only this would make the asm block safer, it would make it more efficient, as it would enable additional compiler optimizations. Of course, this is only possible if your compiler supports intrinsics.

edited Apr 30 '20 at 09:58

answered Apr 30 '20 at 08:02

Andrey Semashev

10,046
1
17
27

1

Yes, but that wouldn't make the inline asm safe; it's also missing a `"memory"` clobber and clobbers on the XMM registers it uses. Buggy GNU C inline asm that compiles and appears to work is very dangerous; a ticking timebomb waiting to break surrounding code. It's also very easy to write buggy code (i.e. hard to get right) so I think it's important that we don't let bugs in inline asm code go unmentioned in Stack Overflow answers; that's dangerous for anyone who might copy/paste some code as an example without reading SO commnets. – Peter Cordes Apr 30 '20 at 08:40
@PeterCordes I've extended my answer. – Andrey Semashev Apr 30 '20 at 09:32
`"vbroadcastss %4, %%xmm1;"` inside the inner loop still sucks, but at least it's not a correctness bug. Without using Intel intrinsics, you could use GNU C native vector syntax the same way that `xmmintrin.h` does; `typedef float v4f __attribute__((vector_size(16)))` so you can get the broadcast `3.0` as a register operand. And BTW, I think you *don't* need a `"memory"` clobber now that you've converted it to use `"m"` inputs / outputs. I'd also use named operands like `[vi] "m"(cyclic[i])` so I could use `%vi` instead of `%1` in the asm template; easy to get the numbers mixed up. – Peter Cordes Apr 30 '20 at 09:37
Oh, `"vcvtdq2ps %3, %%xmm1;"` is a bug from the original source; it overwrites the `addps` result from the previous instruction. Probably they wanted `"vaddps %%xmm0, %%xmm1, %%xmm0 \n\t"`. There are other similar bugs which look like AT&T vs. Intel syntax mixups. – Peter Cordes Apr 30 '20 at 09:38
@PeterCordes You do need "memory" clobber as the compiler doesn't know which parts of the `aResult` array are modified. Re intrinsics and vector attributes, I'm being cautious as the OP mentioned gcc 2.9.3 (possibly 2.93?), which is presumably ancient and I don't really know what it supports. – Andrey Semashev Apr 30 '20 at 09:40
Oh right, good catch. `"=m"(aResult[i])` is only that one float, not `aResult[i + 0..3]`. Casting to an array type (not pointer) is documented to solve that problem, but is more complicated. [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259). Same for the input arrays: without `"memory"` the compiler could reorder the asm with an assignment to a non-multiple-of-4 element. – Peter Cordes Apr 30 '20 at 09:42
@PeterCordes I've updated the answer re output operands and corrected (hopefully) the asm block. – Andrey Semashev Apr 30 '20 at 09:45
Re: GCC version: IDK what version of what the OP is talking about. An ancient GCC 2.93 is not plausible because a binutils of similar age wouldn't know the AVX versions of mnemonics like `vcvtdq2ps`. More likely the `2.` is a typo and they have GCC9.3 which is the current release version, if Ubuntu 20.04 comes with that. – Peter Cordes Apr 30 '20 at 09:46
@PeterCordes Well, he could have updated binutils. The compiler doesn't have to support AVX to compile the asm block. I can only guess. – Andrey Semashev Apr 30 '20 at 09:48
I said not plausible for a reason: yes it's *possible* but I'm happy to rule out someone purposely installing ancient GCC 2.93 on a recent Ubuntu release to experiment with AVX manual vectorization. Also, your code uses an `"x"` constraint which does require the compiler to know *something* about SSE registers. IDK if that support existed before GNU C native vector syntax or not. (Not that it matters in practice) – Peter Cordes Apr 30 '20 at 09:49

Subarrays in inline-assembly. C++

1 Answers1