Fastest way to get a positive modulo in C/C++

Question

Often in my inner loops I need to index an array in a "wrap-around" way, so that (for example) if the array size is 100 and my code asks for element -2, it should be given element 98. In many high level languages such as Python, one can do this simply with my_array[index % array_size], but for some reason C's integer arithmetic (usually) rounds toward zero instead of consistently rounding down, and consequently its modulo operator returns a negative result when given a negative first argument.

Often I know that index will not be less than -array_size, and in these cases I just do my_array[(index + array_size) % array_size]. However, sometimes this can't be guaranteed, and for those cases I would like to know the fastest way to implement an always-positive modulo function. There are several "clever" ways to do it without branching, such as

inline int positive_modulo(int i, int n) {
    return (n + (i % n)) % n;
}

or

inline int positive_modulo(int i, int n) {
    return (i % n) + (n * (i < 0));
}

Of course I can profile these to find out which is the fastest on my system, but I can't help worrying that I might have missed a better one, or that what's fast on my machine might be slow on a different one.

So is there a standard way to do this, or some clever trick that I've missed that's likely to be the fastest possible way?

Also, I know it's probably wishful thinking, but if there's a way of doing this that can be auto-vectorised, that would be amazing.

@Mysticial also if the solution constrains the number I'm modding over to be a power of 2, that's OK. — N. Virgo, Feb 21 '13 at 08:00
Then, you'll want to either hard-code the modulus, or put it in as a compile-time constant. You'll get much better performance that way than whatever tricks you can play with the sign. — Mysticial, Feb 21 '13 at 08:00
Well, modding over a power of two is trivial; you just do `& (n-1)` regardless of sign. — nneonneo, Feb 21 '13 at 08:01
@Nathaniel Yeah, generally division/modulus is very slow. But if it's a compile-time constant, the compiler can usually find ways around it. I'll skip those details since they can be dirty. You'll probably get enough performance gain just from that. The sign issue should come later if it still matters. — Mysticial, Feb 21 '13 at 08:03
For some lovely examples of compiler optimizations, try compiling just `int main(int c, char**v) { return c % N; }` for your favourite `N` and view the generated assembly. Lots of magic numbers :) — nneonneo, Feb 21 '13 at 08:04
possible duplicate of [How to code a modulo (%) operator in C/C++/Obj-C that handles negative numbers](http://stackoverflow.com/questions/4003232/how-to-code-a-modulo-operator-in-c-c-obj-c-that-handles-negative-numbers) — TemplateRex, Feb 21 '13 at 09:14
@rhalbersma I hadn't seen that question (thanks!), but I would say it isn't a duplicate, because the emphasis here is on an efficient solution, rather than just one that works. Most of the answers to the other question involve explicit branching, so they will be slow. — N. Virgo, Feb 21 '13 at 10:30
Like @Mysticial suggested: `template< typename T, T N > T modulo(T value) { return (value % N + N) % N; }` — aggsol, Feb 21 '13 at 11:34
_If_ there is a most efficient way of doing it, for a _compile time constant_ modulo, I'd leave the job to the compiler. If not, until _measurements_ show that this is performance critical, I'd just write the clearest, simplest code possible. BTW, be careful, `a % m` returns negative for negative `a` on same architectures. — vonbrand, Feb 22 '13 at 02:03
@vonbrand I know that `a % m` usually returns a negative value for negative `a` - the question is about how to guarantee a positive result while *also* being efficient. — N. Virgo, Feb 22 '13 at 03:02
@Nathaniel, that isn't so. The `%` operator does whatever the hardware does, Some always return correct `r = a mod b` in the range `0 <= r < abs(b)`, others return negative values if `a` or `b` or both are negative. — vonbrand, Feb 22 '13 at 03:07
@vonbrand yes, that's exactly what I said. `a % m` usually returns a negative value for negative `a`, but I want to guarantee that I will get a positive one. That's the whole point of the question. — N. Virgo, Feb 22 '13 at 03:09
@vonbrand also, FYI, in C++11 it's not machine dependent - it's actually part of the standard that `a % b` should return negative results in this case. — N. Virgo, Feb 22 '13 at 03:11
I'm surprised no one pointed this out, but in C % isn't modulus, it returns the remainder. Even fmod returns the remainder if you look at the documentation: http://www.cplusplus.com/reference/cmath/fmod/ So I think it's weird to call this positive modulus, since the behavior you're looking for is what modulus is supposed to be: http://en.wikipedia.org/wiki/Modular_arithmetic — leetNightshade, Mar 21 '14 at 00:08
With `(i % n) + (n * (i < 0))` I'm seeing result `n` instead of `0` on negative exact multiples, e.g (-3, 3) -> 3. — Randall Whitman, May 16 '19 at 21:47
@Nathaniel "Often I know that index will not be less than -array_size, and in these cases I just do my_array[(index + array_size) % array_size]. However, sometimes this can't be guaranteed" , why it not guaranteed if you are sure that index will never be less than -array_size? and array_size is always positive number. it must work — ChauhanTs, Aug 18 '19 at 12:49
@ChauhanTs I mean, in some applications I am sure this will be the case, but in others I am not. — N. Virgo, Aug 18 '19 at 14:32
It's a shame fmod(n,d) with positive d and non-zero n wasn't specified as returning a value that's at least -d/2 but less than than +d/2. That would make it a proper mod-reduction function (i.e. guarantee that it would yield the same value for two values of n that were congruent mod d) while also ensuring that it would always yield an exact value (which wouldn't be possible if the interval were [0..d). — supercat, Jan 16 '22 at 23:56

Martin B · Answer 1 · 2013-02-21T08:50:18.273

93

The standard way I learned is

inline int positive_modulo(int i, int n) {
    return (i % n + n) % n;
}

This function is essentially your first variant without the abs (which, in fact, makes it return the wrong result). I wouldn't be surprised if an optimizing compiler could recognize this pattern and compile it to machine code that computes an "unsigned modulo".

Edit:

Moving on to your second variant: First of all, it contains a bug, too -- the n < 0 should be i < 0.

This variant may not look as if it branches, but on a lot of architectures, the i < 0 will compile into a conditional jump. In any case, it will be at least as fast to replace (n * (i < 0)) with i < 0? n: 0, which avoids the multiplication; in addition, it's "cleaner" because it avoids reinterpreting the bool as an int.

As to which of these two variants is faster, that probably depends on the compiler and processor architecture -- time the two variants and see. I don't think there's a faster way than either of these two variants, though.

edited Feb 21 '13 at 08:50

answered Feb 21 '13 at 08:13

Martin B

23,670
6
53
72

Nitpick: It actually won't vectorize because there's generally no SIMD support for modulus. – Mysticial Feb 21 '13 at 08:46
1

Would it be more efficient to factor the `n` out into a template? In the case that the function cannot be inlined, the compiler may be able to play some tricks to improve performance. – Alex Chamberlain Feb 21 '13 at 09:02
Oops, you're right about the abs(), I've edited it out of my question. – N. Virgo Feb 21 '13 at 10:32
also corrected the typo in the second example. (I really should have tested them first.) – N. Virgo Feb 22 '13 at 03:03
1

Notice that for (-3 mod 3) using `(i % n) + (n * (i < 0))` or `(i % n) + (i < 0 ? n : 0)`, the result is 3: `(-3 % 3) == 0` and `(3 * (-3 < 0)) == 3`, probably not the desired result. – Qwertie Jun 20 '13 at 22:57
One problem is if n is negative, then the module becomes negative... This instead seem to work: `return (i % n + std::abs(n)) % n;` – Antonio Mar 08 '17 at 15:14
`i % n + n` may overflow leading to incorrect functionality (UB). – chux - Reinstate Monica Aug 18 '19 at 11:38

Jorge Bellon · Accepted Answer · 2020-08-03T16:49:21.680

33

Most of the time, compilers are very good at optimizing your code, so it is usually best to keep your code readable (for both compilers and other developers to know what you are doing).

Since your array size is always positive, I suggest you to define the quotient as unsigned. The compiler will optimize small if/else blocks into conditional instructions which have no branches:

unsigned modulo( int value, unsigned m) {
    int mod = value % (int)m;
    if (mod < 0) {
        mod += m;
    }
    return mod;
}

This creates a very small function without branches:

modulo(int, unsigned int):
        mov     eax, edi
        cdq
        idiv    esi
        add     esi, edx
        mov     eax, edx
        test    edx, edx
        cmovs   eax, esi
        ret

For example modulo(-5, 7) returns 2.

Unfortunately, since the quotient is not known they must perform an integer division, which is a bit slow compared to other integer operations. If you know the sizes of your array are power of two, I recommend keeping these function definitions in a header, so that the compiler can optimize them into a more efficient function. Here is the function unsigned modulo256(int v) { return modulo(v,256); }:

modulo256(int):                          # @modulo256(int)
        mov     edx, edi
        sar     edx, 31
        shr     edx, 24
        lea     eax, [rdi+rdx]
        movzx   eax, al
        sub     eax, edx
        lea     edx, [rax+256]
        test    eax, eax
        cmovs   eax, edx
        ret

See assembly: https://gcc.godbolt.org/z/DG7jMw

See comparison with most voted answer: http://quick-bench.com/oJbVwLr9G5HJb0oRaYpQOCec4E4

Edit: turns out Clang is able to generate a function without any conditional move instructions (which cost more than regular arithmetic operations). This difference is completely negligible in the general case due to the fact that the integral division takes around 70% of the total time.

Basically, Clang shifts value right to extend its sign bit to the whole width of m (that is 0xffffffff when negative and 0 otherwise) which is used to mask the second operand in mod + m.

unsigned modulo (int value, unsigned m) {
    int mod = value % (int)m;
    m &= mod >> std::numeric_limits<int>::digits;
    return mod + m;
}

edited Aug 03 '20 at 16:49

answered Sep 26 '19 at 14:19

Jorge Bellon

2,901
15
25

Thank you, this is very interesting. Also interesting that specifying 29 gives some saving over the generic function, even if a power of 2 is even faster. I ran the benchmark on g++ also, with similar results. I'm accepting this answer because I think it does actually supersede the information in the other, higher-voted answers. – N. Virgo Sep 26 '19 at 15:04
I did the number 29 on purpose (it is constant and also a prime number, so it does not work as fine as a power of two). Compilers are very smart as I told you, and may replace a division by a known integer divisor with a multiplication and a shift which is a bit faster. – Jorge Bellon Sep 26 '19 at 15:37
I realise that - I was just commenting that it was interesting. – N. Virgo Sep 26 '19 at 17:13
1

If you want to know the exact methodology for this, there are books/websites that will give you more information about this : for example the PowerPC Compiler Writer's Guide has a section on this at pages 52 to 61, and Matt Godbolt talked about this in his "What has my compiler done for me lately ?" talk, at the 35th minute – Gabriel Ravier Nov 10 '19 at 20:26
If you check my updated [quickbench results](http://quick-bench.com/9D_wxhAg7IC4bbfAMQRy4klvBww) You'll see that [my answer](https://stackoverflow.com/questions/14997165/fastest-way-to-get-a-positive-modulo-in-c-c/61469098#61469098) is faster on GCC in all cases, and intentionally avoids a conditional move. It's the same on Clang for the generic case, because clang happens to generate the same assembly. – Kyle Butt Apr 27 '20 at 23:13
1

Thanks. I've updated the answer to include why not using conditional moves is faster, even though I only see improvements (with GCC) for the constant division and not for the generic case. – Jorge Bellon May 01 '20 at 10:58
There was mistake in my original code that made it faster. That has been fixed, and it is now equivalent with a conditional move in the generic case. I can't edit the comment. – Kyle Butt May 01 '20 at 18:05
2

This code is incorrect. It won't work for modulo(-x, x) and returns x in such a case. – Jan Schultke Aug 03 '20 at 12:03
1

You have to righshift mod instead, not value. – Jan Schultke Aug 03 '20 at 12:56
The `modulo256` compiled by both gcc and clang is actually not very efficient, considering it can be implemented as simply `value & 255`. – interjay Mar 30 '21 at 21:52
@Jorge Doing `& 255` gives exactly what OP asked for (the positive modulo), and is equivalent to what `modulo256` does here (assuming 2's complement) but faster. – interjay Dec 21 '21 at 11:37

score 31 · Answer 3 · answered Feb 21 '13 at 08:02

31

Modulo a power of two, the following works (assuming twos complement representation):

return i & (n-1);

answered Feb 21 '13 at 08:02

nneonneo

171,345
36
312
383

Many thanks! I will leave the question open in case someone has a good answer for the general case, but I will probably end up using this. – N. Virgo Feb 21 '13 at 08:03
I don't understand this solution, could you please explain? For example, if we have 7 mod 2 -> 0111 mod 0010 -> 0110 & 0010 = 2 and it should be 1. What am I missing? – ixSci Feb 21 '13 at 08:39
You use `n-1`, not `n`. So in this case, it would be `0111 & 1 = 1`. Note that when `n` is a power of two, `n-1` consists of all ones. – nneonneo Feb 21 '13 at 08:40
1

what is `n` here? `n mod i` or `i mod n`? – ixSci Feb 21 '13 at 08:42
1

As simple as the answer is, I would be very careful. Remember different architectures generally store negative numbers in different ways. Hence bitwise operators on negative numbers cant differ with different compilers and/or architectures. – mity Feb 21 '13 at 08:42
@ixSci: `n` is the modulus. – nneonneo Feb 21 '13 at 08:49
@nneonneo, sorry my ignorance but I don't know how to apply term "modulus" to `a mod b` where `a` is dividend, `b` - divisor and mod is modulus operator. Just give me an answer in letter, please. What is `n` in the aforementioned `a mod b`, `a` or `b`? – ixSci Feb 21 '13 at 08:59
2

`i mod n` == `i & (n-1)` when `n` is a power of two and `mod` is the aforementioned positive mod. (FYI: `modulus` is the common mathematical term for the "divisor" when a modulo operation is considered). – nneonneo Feb 21 '13 at 09:01
7

@GrijeshChauhan: The limitations are clearly stated: `n` must be a power of two and numbers must use twos-complement (pretty much every computer produced in the last 20 years). When else will it fail? – nneonneo Feb 21 '13 at 09:03

jthill · Answer 4 · 2020-08-03T18:59:54.263

9

An old-school way to get the optional addend using twos-complement sign-bit propagation:

int positive_mod(int i, int m)
{
    /* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
    int r = i%m;
    return r+ (r>>shift & m);
}

edited Aug 03 '20 at 18:59

answered Feb 22 '13 at 04:46

jthill

55,082
5
77
137

Old-school hard to read hack. I like it. Though I wonder if `(i>>shift & n)` might be faster as the bitshift operation will otherwise have to wait for the modulo operation to finish. – aaaaaaaaaaaa Feb 22 '13 at 22:47
It would be faster but it would give incorrect results for e.g. -2 mod 2. – jthill Feb 23 '13 at 00:28
Shoot, you are right. And now that you mention it, that is true for `(i % n) + (n * (i < 0))` as well. – aaaaaaaaaaaa Feb 23 '13 at 10:32
Assuming CHAR_BIT is a global contest (of the system?) sizeof of what? I do not understand if it is CHAR_BIT*(sizeof(i)) -1or – roschach Aug 31 '18 at 15:30
It's not old school. It's a common trick used if you need constant time operations. You can use the trick for a complete mod function: ```#include template T modulo(T value, T m) { size_t shift_width = std::numeric_limits::digits - 1; T mod = (value % m); T mask = (value >> shift_width) ^ (m >> shift_width); mod += (mask & m); return mod; }``` The results are what you want: modulo(1, 3) = 1 modulo(-1, 3) = 2 modulo(-1, -3) = -1 modulo(1, -3) = -2 Sorry about the formatting. – Kyle Butt May 01 '20 at 19:31
@J.Schultke What led you to say that? As a bare assertion it's simply false. – jthill Aug 03 '20 at 15:24
@jthill nevermind, this is what happens for the accepted answer because in the accepted answer `i` would be rightshifted instead. Your version is correct and does not have the same issue, but I didn't pay close enough attention to it and just assumed it's an identical solution. – Jan Schultke Aug 03 '20 at 17:36
1

@J.Schultke Okay, I changed some names anyway to fix possible confusion, now `m` is the modulus and `r` is the result, no what-is-this-number-really `n`'s left. – jthill Aug 03 '20 at 19:02

chux - Reinstate Monica · Answer 5 · 2022-12-23T17:16:07.393

Fastest way to get a positive modulo in C/C++

The following fast? - maybe not as fast as others, yet is simple and functionally correct for all¹ a,b -- unlike others.

int modulo_Euclidean(int a, int b) {
  int m = a % b;
  if (m < 0) {
    // m += (b < 0) ? -b : b; // Avoid this form: -b is UB when b == INT_MIN
    m = (b < 0) ? m - b : m + b;
  }
  return m;
}

[Edit 2022]

From here, added tests to handle INT_MIN mod -1 and detect mod 0.

int modulo_Euclidean2(int a, int b) {
  if (b == 0) TBD_Code(); // Perhaps return -1 to indicate failure?
  if (b == -1) return 0; // This test needed to prevent UB of `INT_MIN % -1`.
  int m = a % b;
  if (m < 0) {
    // m += (b < 0) ? -b : b; // Avoid this form: it is UB when b == INT_MIN
    m = (b < 0) ? m - b : m + b;
  }
  return m;
}

Various other answers have mod(a,b) weaknesses especially when b < 0.

See Euclidean division for ideas about b < 0

inline int positive_modulo(int i, int n) {
    return (i % n + n) % n;
}

Fails when i % n + n overflows (think large i, n) - Undefined behavior.

return i & (n-1);

Relies on n as a power of two. (Fair that the answer does mention this.)

int positive_mod(int i, int n)
{
    /* constexpr */ int shift = CHAR_BIT*sizeof i - 1;
    int m = i%n;
    return m+ (m>>shift & n);
}

Often fails when n < 0. e, g, positive_mod(-2,-3) --> -5

int32_t positive_modulo(int32_t number, int32_t modulo) {
    return (number + ((int64_t)modulo << 32)) % modulo;
}

Obliges using 2 integer widths. (Fair that the answer does mention this.)
Fails with modulo < 0. positive_modulo(2, -3) --> -1.

inline int positive_modulo(int i, int n) {
    int tmp = i % n;
    return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

Often fails when n < 0. e, g, positive_modulo(-2,-3) --> -5

¹ Exceptions: In C, a%b is not defined when a/b overflows as in a/0 or INT_MIN/-1.

@cassepipe `+=` is fine, but `-b` when `b == INT_MAN` is UB. Note added to answer. — chux - Reinstate Monica, Jun 30 '21 at 18:11
@chux-ReinstateMonica Thanks ! Indeed INT_MIN has no positive equivalent in the int range as it would be above INT_MAX by one. And this is because of https://en.wikipedia.org/wiki/Two%27s_complement (Putting that there for beginners such as myself not so long ago) — cassepipe, Jun 30 '21 at 18:21

score 4 · Answer 6 · answered Oct 09 '18 at 16:49

4

If you can afford to promote to a larger type (and do your modulo on the larger type), this code does a single modulo and no if:

int32_t positive_modulo(int32_t number, int32_t modulo) {
    return (number + ((int64_t)modulo << 32)) % modulo;
}

answered Oct 09 '18 at 16:49

user15006

41
1

Kyle Butt · Answer 7 · 2020-08-07T15:40:52.867

4

If you want to avoid all conditional paths (including the conditional move generated above, (For example if you need this code to vectorize, or to run in constant time), You can use the sign bit as a mask:

unsigned modulo(int value, unsigned m) {
  int shift_width = sizeof(int) * 8 - 1;
  int tweak = (value >> shift_width);
  int mod = ((value - tweak) % (int) m) + tweak;
  mod += (tweak & m);
  return mod;
}

Here are the quickbench results You can see that on gcc it's better in the generic case. For clang it's the same speed in the generic case, because clang generates the branch free code in the generic case. The technique is useful regardless, because the compiler can't always be relied on to produce the particular optimization, and you may have to roll it by hand for vector code.

edited Aug 07 '20 at 15:40

answered Apr 27 '20 at 21:53

Kyle Butt

9,340
3
22
15

1

I know the OP doesn't need constant time, as it's for an array lookup, but this has been linked to as the fast way to compute modulo, which *someone* may need to do in constant time, so I figured it was worth mentioning. – Kyle Butt Apr 27 '20 at 21:54
1

Your godbolt link has a mistake because you are performing unsigned division instead of signed (you are missing the cast). – Jorge Bellon Apr 30 '20 at 17:25
Intel doesn't currently support integer divison as a vector unit, and neither does Arm, but they aren't the only CPUs with vector units, and they may get integer division in the future. – Kyle Butt Apr 30 '20 at 17:53
1

I've given a small look and the quick bench results show the same performance when `m` is not a constant (just ran your link with clearing the cached results). GCC reports the same assembly if you code it like `m &= value < 0? UINT_MAX : 0u; mod += m;` which is much more readable than using the shift right (the right shift is just adding an all 1s bitmask when the sign bit is set). The fact that Clang does the thing right proves even further than letting the compiler do the dirty work is usually a good idea. – Jorge Bellon May 01 '20 at 10:35
If you need it to run in constant time, it's a bad idea to rely on the compiler. – Kyle Butt May 01 '20 at 18:17
Shifting an expression of signed type and negative value is sadly implementation-defined (see [here](https://stackoverflow.com/questions/7622/are-the-shift-operators-arithmetic-or-logical-in-c)). Moreover, it is more portable to use `CHAR_BIT` (although almost all modern platforms set it to 8). This being said, this code should work many common platforms. – Jérôme Richard Jul 31 '20 at 19:07
@JérômeRichard Yes, it assumes a Two's complement machine. While not in the standard, it's a reasonable assumption. – Kyle Butt Aug 04 '20 at 18:31

Aki Suihkonen · Answer 8 · 2013-02-21T12:42:11.157

You can as well do array[(i+array_size*N) % array_size], where N is large enough integer to guarantee positive argument, but small enough for not to overflow.

When the array_size is constant, there are techniques to calculate the modulus without division. Besides of power of two approach, one can calculate a weighted sum of bitgroups multiplied by the 2^i % n, where i is the least significant bit in each group:

e.g. 32-bit integer 0xaabbccdd % 100 = dd + cc*[2]56 + bb*[655]36 + aa*[167772]16, having the maximum range of (1+56+36+16)*255 = 27795. With repeated applications and different subdivision one can reduce the operation to few conditional subtractions.

Common practises also include approximation of division with reciprocal of 2^32 / n, which usually can handle reasonably large range of arguments.

 i - ((i * 655)>>16)*100; // (gives 100*n % 100 == 100 requiring adjusting...)

SkYWAGz · Answer 9 · 2015-09-16T16:20:22.893

0

Your second example is better than the first. A multiplication is a more complex operation than an if/else operation, so use this:

inline int positive_modulo(int i, int n) {
    int tmp = i % n;
    return tmp ? i >= 0 ? tmp : tmp + n : 0;
}

edited Sep 16 '15 at 16:20

answered Sep 16 '15 at 15:10

SkYWAGz

172
1
4
13

1) you're right, I edited the code. 2) if i is negative the return is a negative, i%n returns a negative number, for example -102%100 returns -2 so u just add n to the result – SkYWAGz Sep 16 '15 at 15:58
1) Perhaps simply `return tmp < 0 ? tmp + n : tmp;`. 2) This answer has an advantage over [highly rated one](http://stackoverflow.com/a/14997413/2410359) in that it never overflows. – chux - Reinstate Monica Sep 16 '15 at 16:26
Re-state as "it" was unclear: This answer never overflows. (advantage) (`if n > 0`). The [other answer](http://stackoverflow.com/a/14997413/2410359) may overflow. (weakness). – chux - Reinstate Monica Sep 16 '15 at 17:08

Fastest way to get a positive modulo in C/C++

9 Answers9

Linked

Related