As Eric wrote in a comment, you should pay attention to the carry. After your loop is done, you may have some carry from the highest position. If in the end carry is not zero, then it has to be one. Then its value is 2^256
, corresponding to index 8
. Since
2^256 ≡ 2^32 + 977 (mod P)
you may account for this carry by adding 2^32 + 977
to your result so far. You can probably do so in an optimized manner (i.e. not re-using the same add loop), since you know the one term to be mostly zeros so you can stop after the first (least significant) two “digits” are added as soon as the carry has become zero. (I'm using the term “digit” for each of your u32
array members.)
What do you do if during that addition the carry at the high end of the addition is non-zero a second time? As Eric noted, when each of your inputs is less than P
, the sum will be less than 2P
so subtracting P
once (which is what the shift from 2^256
to 2^32 + 977
does) will make it less than P
. So no need to worry, you can stop the loop when carry becomes zero no matter the digit count.
And what if the resulting sum is bigger than P
but less than 2^256
so you don't get any carry? To also cover this situation, you can compare the result against P
, and subtract P
unless it's smaller. Subtraction is a lot easier than division. You can skip this check if you did the code path for the non-zero carry. You can also optimize that check somewhat, aborting if any of the first 6 “digits” is less than 2^32-1
. Only if they all equal 2^32-1
then you can do some minor comparisons and computations to do the actual subtraction in the lowest two “digits” before clearing all the higher “digits”.
In Python-like pseudo-code and glossing over the details of how to detect overflow or underflow happening in the line before:
def fe_add(x, y, res):
carry = 0
for i in 0 .. 7:
res[i] = x[i] + y[i] + carry
carry = 1 if overflow else 0
# So far this is what you had.
if carry != 0:
# If carry == 1: add 2^32 + 977 instead.
res[0] += 977
res[1] += 1 + (1 if overflow else 0)
carry = 1 if overflow else 0
i = 2
while carry != 0:
res[i] += 1
carry = 1 if overflow else 0
i++
else:
# Compare res against P.
for i in 7 .. 2:
if res[i] != 2^32 - 1:
return
if res[1] == 2^32 - 1 or (res[1] == 2^32 - 2 and res[0] >= 2^32 - 977):
# If res >= P, subtract P.
res[0] -= 2^32 - 977
res[1] -= 2^32 - 2 + (1 if underflow else 0)
for i in 2 .. 7:
res[i] = 0
There is an alternative. Instead of using numbers from the range [0 .. P-1]
to represent your elements of the modulo group, you might also choose to use [2^32 + 977 .. 2^256-1]
instead. That would simplify some operations but complicate others. Additions in particular would be simpler, because just handling the nonzero carry situation discussed above would be enough. Comparing whether a number is ≡ 0 (mod P)
would be more complicated, for example. And it might also be confusing some code contributors. As usual with changes that might improve performance, tests would be best suited to tell whether one or the other solution performs better in practice. But perhaps you might want to design your API so that you can swap these implementation details without any code using them even noticing it. This could mean e.g. not relying on zero initialization to initialize a zero element of that data type but having a function instead.