Strange behaviour of clang assembler

Question

I tried to compile this overflow detection macro of Zend engine:

#define ZEND_SIGNED_MULTIPLY_LONG(a, b, lval, dval, usedval) do {   \
    long __tmpvar;                                                  \
    __asm__( \
        "mul %0, %2, %3\n"                                      \
        "smulh %1, %2, %3\n"                                        \
        "sub %1, %1, %0, asr #63\n"                                 \
            : "=X"(__tmpvar), "=X"(usedval)                         \
            : "X"(a), "X"(b));                                      \
    if (usedval) (dval) = (double) (a) * (double) (b);              \
    else (lval) = __tmpvar;                                         \
} while (0)

And got this result in assembly:

; InlineAsm Start
mul     x8, x8, x9
smulh   x9, x8, x9
sub x9, x9, x8, asr #63

; InlineAsm End

The compiler used only 2 register for both input and output of the macro, which i think it must be at least 3, and lead to wrong result of the calculation (for example, -1 * -1). Any suggestion?

Sebastian Redl · Accepted Answer · 2016-04-21T08:50:00.897

The assembly code is buggy. From GCC's documentation on extended asm:

Use the ‘&’ constraint modifier (see Modifiers) on all output operands that must not overlap an input. Otherwise, GCC may allocate the output operand in the same register as an unrelated input operand, on the assumption that the assembler code consumes its inputs before producing outputs. This assumption may be false if the assembler code actually consists of more than one instruction.

This basically says that from the moment you write to an output parameter not marked with an ampersand, you're not allowed to use the input parameters anymore because they might have been overwritten.

almost-right answer, wrong reason. "input parameter used twice" is nonsense. It's actually needed on all output params written before the last input param is read. — Peter Cordes, Apr 21 '16 at 08:45

score 3 · Answer 2 · edited May 23 '17 at 12:19

The syntax is designed around the concept of wrapping a single insn which reads its inputs before writing its outputs.

When you use multiple insns, you often need to use an early-clobber modifier on the constraint ("=&x") to let the compiler know you write an output or read-write register before reading all the inputs. Then it will make sure that output register isn't the same register as any of the input registers.

See also the x86 tag wiki, and my collection of inline asm docs and SO answers at the bottom of this answer.

#define ZEND_SIGNED_MULTIPLY_LONG(a, b, lval, dval, usedval) do {   \
    long __tmpvar;                                                  \
    __asm__( \
        "mul   %[tmp], %[a], %[b]\n\t"                              \
        "smulh %[uv], %[a], %[b]\n\t"                               \
        "sub   %[uv], %[uv], %[tmp], asr #63\n"                     \
            : [tmp] "=&X"(__tmpvar), [uv] "=&X"(usedval)            \
            : [a] "X"(a), [b] "X"(b));                              \
    if (usedval) (dval) = (double) (a) * (double) (b);              \
    else (lval) = __tmpvar;                                         \
} while (0)

Do you really need all those instructions to be inside the inline asm? Can't you make long tmp = a * b an input operand? Then if the compiler needs a*b elsewhere in the function, CSE can see it.

You can convince gcc to broadcast the sign bit with an arithmetic right shift using a ternary operator. So hopefully you can coax the compiler to do the sub that way. Then it could use subs to set flags from the sub instead of needing a separate test insn on usedval.

If you can't get your target compiler to make the code you want, then sure, give inline asm a shot. But beware, I've seen clang be a lot worse than gcc with inline asm. It tends to make worse code around the inline asm on x86.

Thank you @Peter Cordes! Main reason is that due to the function is defined in latest php source, i dont want to get messy with it at all. The macro is inside an #if that only for aarch64. — Sunary, Apr 21 '16 at 08:54
@Sunary. I don't think `long __tmpvar = (long)a*(long)b` is "messy", unless `a` and `b` have different types, or that somehow doesn't compile well on AArch64. — Peter Cordes, Apr 21 '16 at 08:57

Strange behaviour of clang assembler

2 Answers2