How to efficiently compute 2¹²⁸ % n, where n is uint64_t
(and non-zero)?
If I had access to a narrowing division like _udiv128
intrinsic of MSVC I could do:
uint64_t remainder;
_udiv128(1, 0, n, &remainder);
_udiv128(remainder, 0, n, &remainder);
But that's two slow instructions, not to mention there are CPUs like ARM which do not have the narrowing division. Is there a better way?
Related: How to compute 2⁶⁴/n in C?