C++ Inline asm idiv modulo

Question

I am writing a modulo function and want to optimize the number of instructions called. Currently it looks like this

#include <cstdlib>
constexpr long long mod = 1e9 + 7;
static __attribute__((always_inline)) long long modulo(long long x) noexcept
{
    return lldiv(x, mod).rem;
}

and even though I am compiling with -static there is a call to lldiv that isn't being inlined. I want to get around this by adding inline assembly to my function to call the idiv instruction directly. How can I do this?

I am using g++ 9.4.0 and I'm targeting x86. Since this is related to a competetive programming contest, the code will be compiled with g++ -std=c++17 -O3 -static.

For what architecture? It's normal that library functions aren't inlined, although GCC does have some common ones as builtins like sqrt. Any reason you can't just `return x % mod` like a normal person? Compilers will expand that inline, except for types wider than a register where GCC tends to call a helper function (e.g. for `long long` on a 32-bit target like i386) — Peter Cordes, Aug 21 '22 at 21:19
"warning: 'always_inline' function might not be inlinable [-Wattributes]" gcc gives a warning that your code may not be inlined. Not all library functions can be inlined. — jpr42, Aug 21 '22 at 21:21
@PeterCordes I'm targeting x86. The modulo operator with -O3 expands to a lot of instructions, and I want to have as few as possible. — Adam Soltan, Aug 21 '22 at 21:22
@jpr42 How my function is being inlined is irrelevant to this question. ( want to replace the call to lldiv with inline assembly. — Adam Soltan, Aug 21 '22 at 21:23
https://godbolt.org/z/173TEjhs5 shows x86-64 output using `-Os` (minimize code size at the expense of speed, especially for integer division by constants, using cqo+idiv) vs. `-O2` / `-O3` where it uses a [fast multiplicative inverse](https://stackoverflow.com/questions/41183935/why-does-gcc-use-multiplication-by-a-strange-number-in-implementing-integer-divi) costing a few extra instructions, especially for signed integers. But *much* faster, [especially on Intel before Ice Lake for 64-bit operand-size](https://stackoverflow.com/questions/56655383/can-128bit-64bit-hardware-unsigned). — Peter Cordes, Aug 21 '22 at 21:24
@PeterCordes, since this is in a competetive programming setting, I don't have control over the compiler options. Also I want O3 for the rest of the program. I don't care which is faster, just which one has fewer instructions since the contest I want to use this for counts instructions, not time. — Adam Soltan, Aug 21 '22 at 21:26
What kind of competition is it where fewer but slower instructions is better, even when the program is compiled with `-O3` not `-Os`? `-O3` will spend extra code size all over the place to gain speed, e.g. auto-vectorizing. It is of course possible to use `cqo`/`idiv` from inline asm, you just normally don't want to. See [What is the difference between 'asm', '\_\_asm' and '\_\_asm\_\_'?](https://stackoverflow.com/posts/comments/59576185) for idiv examples; I guess you'd want to `__int128 dividend = x;` and take the low/high halves of that. — Peter Cordes, Aug 21 '22 at 21:28
Regardless of whether or not this is a good way of judging programs, it is what it is and I don't have control over it. — Adam Soltan, Aug 21 '22 at 21:29
What I'm asking is exactly what criteria your program is being judged on. I want to understand why it makes sense to use small but slow division. Is your program overall being judged on code-size, not speed, despite the compile options? — Peter Cordes, Aug 21 '22 at 21:30
It is being judged on the number of instructions called during execution of the whole program. My solution is heavily reliant on modulo, since `a % b` is about 10 instructions, shortening that down to 4 or 5 will give a huge "speed" boost. — Adam Soltan, Aug 21 '22 at 21:32
Ok, so a silly metric and fixed compiler options that don't optimize for it. You may be able to use `#pragma GCC optimize("Os")` to optimize for the metric you're being judged on. https://godbolt.org/z/Ge4zrzjGq shows it gets what you want for `x % mod`. Although other parts of your code might benefit from fully unrolling loops; `-Os` optimizes for static code size, not dynamic instruction count. And auto-vectorization can get more work done with fewer instructions if there are any loops that can vectorize. (`__attribute__((target("avx2")))` perhaps) — Peter Cordes, Aug 21 '22 at 21:39
Any way to change the optimization setting just for one function? (I'm already using a few target pragmas like avx2) — Adam Soltan, Aug 21 '22 at 21:41
Right, of course you'd want to use 64-bit operand-size `idiv`, so you widen the input types to 64-bit. Maybe put a `cqo` inside the asm template so you only have to provide a `"+a"(x)` in/out operand, not `(int64_t)( static_cast<__int128>(x) >> 64)` if the compiler doesn't reliably use `cqo` for that to generate an RDX input; RDX is an output anyway so no harm in writing it in the asm template, except for defeating constant propagation. — Peter Cordes, Aug 21 '22 at 21:43
Yeah, `__attribute__((optimize("Os")))` as mentioned by the docs for the pragma (https://gcc.gnu.org/onlinedocs/gcc/Function-Specific-Option-Pragmas.html / https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html). But that probably doesn't work when inlining into a function with a different optimization level. GCC won't inline at all when ISA options mismatch (like an AVX2 function can't inline into a non-avx2 function, since the optimizer wouldn't know how to keep the AVX2 instructions contained to certain code paths, not e.g. hoisting some out of ifs) — Peter Cordes, Aug 21 '22 at 21:46
The % operator in C is a remainder operator that returns 0 or a number with the same sign as the dividend. In math, a modulo operator returns a 0 or a number with the same sign as the divisor. If the goal is a mathematical modulo operator (which appears to be the case based on the constant 1e9 + 7), then if the remainder is not zero and doesn't have the same sign as the divisor, add the divisor to the remainder to get the mathematical modulo. — rcgldr, Aug 22 '22 at 02:18

C++ Inline asm idiv modulo

0 Answers0