Optimize modulus arithmetic with big numbers

Question

I'd like to implement big number arithmetic operations modulo P, with P = 2^256 - 2^32 - 977. Note that P is fixed so any optimization can be hardcoded.

I'm using an array of 8 u32 to represent a u256:

struct fe {
   uint32_t b[8]; // 256 = 8 x 32
};

Now a simple version of the addition could look like this

void fe_add(struct fe *x, struct fe *y, struct fe *res) {
    int carry = 0;

    for (int i = 0; i < 8; ++i) {
        uint32_t tmp = x->b[i] + y->b[i] + carry;
        carry = tmp < x->b[i] && tmp < y->b[i];
        res->b[i] = tmp;        
    }
}

Now to support (x + y) % P, I could use this definition and define -, *, and / over struct fe:

// (x + y) % P = (x + y) - (P * int((x + y) / P))

fe_add(&x, &y, &t1); // t1 = x + y
fe_div(&t1, &P, &t2); // t2 = (x + y) / P
fe_mult(&P, &t2, &t3); // t3 = P * ((x + y) / P)
fe_sub(&t1, &t3, &res); // res = x + y - (P * ((x + y) / P))

What would be a better way to implement (x + y) % P directly during the addition, knowing that P won't change?

First, `fe_add` is incorrect since it loses the carry out of the high word. If the carry is non-zero, it should be reduced modulo P (yielding 2^32+977) and added into the result. Second, `fe_add` could reduce the sum modulo P simply by testing whether it is greater than or equal to P and, if so, subtracting P. (Adding two non-negative numbers less than P can only produce a number less than 2P, so it is only ever necessary to subtract P once. Performing a division and multiplication is unnecessary, as the quotient can only be 0 or 1.) — Eric Postpischil, Jun 21 '21 at 10:52
Given that P is so close to 2^256, one might consider letting the reduction of the carry suffice and allowing some results to exceed P (but be less than 2^256). Then the other arithmetic routines would be designed to accommodate this. — Eric Postpischil, Jun 21 '21 at 10:53
Note that [`carry = tmp < x->b[i];`](https://stackoverflow.com/questions/33948450/c-detect-unsigned-int-overflow-of-addition) suffices; you do not need to test with both operands. — Eric Postpischil, Jun 21 '21 at 10:58
That makes total sense, thank you > _Then the other arithmetic routines would be designed to accommodate this._ if I let results exceed P, how about storing the carry somewhere in the struct and not deal with P at all in all the routines, but only have a normalize function at the end that does exactly what you said (in pseudo code): `if (carry) { result += 2^32 + 977 }` to add 2^256 - P `if (result >= P) { result -= P }` to substract P — Ervadac, Jun 21 '21 at 11:38

MvG · Accepted Answer · 2021-06-21T13:21:23.563

As Eric wrote in a comment, you should pay attention to the carry. After your loop is done, you may have some carry from the highest position. If in the end carry is not zero, then it has to be one. Then its value is 2^256, corresponding to index 8. Since

2^256 ≡ 2^32 + 977 (mod P)

you may account for this carry by adding 2^32 + 977 to your result so far. You can probably do so in an optimized manner (i.e. not re-using the same add loop), since you know the one term to be mostly zeros so you can stop after the first (least significant) two “digits” are added as soon as the carry has become zero. (I'm using the term “digit” for each of your u32 array members.)

What do you do if during that addition the carry at the high end of the addition is non-zero a second time? As Eric noted, when each of your inputs is less than P, the sum will be less than 2P so subtracting P once (which is what the shift from 2^256 to 2^32 + 977 does) will make it less than P. So no need to worry, you can stop the loop when carry becomes zero no matter the digit count.

And what if the resulting sum is bigger than P but less than 2^256 so you don't get any carry? To also cover this situation, you can compare the result against P, and subtract P unless it's smaller. Subtraction is a lot easier than division. You can skip this check if you did the code path for the non-zero carry. You can also optimize that check somewhat, aborting if any of the first 6 “digits” is less than 2^32-1. Only if they all equal 2^32-1 then you can do some minor comparisons and computations to do the actual subtraction in the lowest two “digits” before clearing all the higher “digits”.

In Python-like pseudo-code and glossing over the details of how to detect overflow or underflow happening in the line before:

def fe_add(x, y, res):
  carry = 0
  for i in 0 .. 7:
    res[i] = x[i] + y[i] + carry
    carry = 1 if overflow else 0
  # So far this is what you had.

  if carry != 0:
    # If carry == 1: add 2^32 + 977 instead.
    res[0] += 977
    res[1] += 1 + (1 if overflow else 0)
    carry = 1 if overflow else 0
    i = 2
    while carry != 0:
      res[i] += 1
      carry = 1 if overflow else 0
      i++

  else:
    # Compare res against P.
    for i in 7 .. 2:
      if res[i] != 2^32 - 1:
        return
    if res[1] == 2^32 - 1 or (res[1] == 2^32 - 2 and res[0] >= 2^32 - 977):
      # If res >= P, subtract P.
      res[0] -= 2^32 - 977
      res[1] -= 2^32 - 2 + (1 if underflow else 0)
      for i in 2 .. 7:
        res[i] = 0

There is an alternative. Instead of using numbers from the range [0 .. P-1] to represent your elements of the modulo group, you might also choose to use [2^32 + 977 .. 2^256-1] instead. That would simplify some operations but complicate others. Additions in particular would be simpler, because just handling the nonzero carry situation discussed above would be enough. Comparing whether a number is ≡ 0 (mod P) would be more complicated, for example. And it might also be confusing some code contributors. As usual with changes that might improve performance, tests would be best suited to tell whether one or the other solution performs better in practice. But perhaps you might want to design your API so that you can swap these implementation details without any code using them even noticing it. This could mean e.g. not relying on zero initialization to initialize a zero element of that data type but having a function instead.

hey, thanks for taking the time and the tips. I actually ended up managing the carry and the mod part in a different function so I can do multiple additions / substractions easily, the carry is stored in the struct and so `2^32 + 977` can be added multiple times. Not sure yet if it has a major benefit + the carry need to be managed as well. FYI, i found an implementation with the same P (in the bitcoin sources) using 10xu26 instead of 8xu32 and this seems to be more optimized dealing with the carry, especially for the multiplication. — Ervadac, Jun 23 '21 at 21:44

Optimize modulus arithmetic with big numbers

1 Answers1