3

I'm confused about how one might supposedly bypass the final subtraction of the modulus in radix-2 montgomery modular multiplication, when used in a modular exponentiation algorithm. The following two papers put forward the conditions for bypassing the subtraction.

I don't understand exactly what is required in terms of the "preprocessing and postprocessing" to eliminate the need for the repetitive subtraction of the modulus at the end of the montgomery multiplication.

My understanding after reading the above papers is that, to eliminate the final subtractions, you must:

  1. Zero-extend each input operand to the modular exponentiation by two

    e.g. new[2049 downto 0]  = (0, 0, old[2047 downto 0]) 
    
  2. Increase the loop bound inside the Montgomery multiplications by two, such that two more iterations of the loop are executed

I've made these modifications to a working algorithm, however the results are not as I expect and I do not understand why. Therefore, I assume I am misinterpreting something in these papers, or leaving out a critical step.

Let us refer to my (working) radix-2 montgomery modular exponentiation function in (C-like pseudocode). Note that I have extended the operand width by two most-significant zero digits (just to make sure I'm not overflowing). They used to only be 2048 bits.

let NUM_BITS = 2048
let rsaSize_t be a 2050-bit vector type

// Montgomery multiplication: outData = XYr^(-1) modulo M,     
// where the radix r=2^n    (n=NUM_BITS) 
function montMult( rsaSize_t X,       // Multiplier
                   rsaSize_t Y,       // Multiplicand
                   rsaSize_t M,       // Modulus
                   rsaSize_t outData) // Result
{
    rsaSize_t S = 0;  // Running sum

    for (i=0; i<NUM_BITS; i++)
    {
        if (X.bit(i)==1) // Check ith bit of X
            S += Y;

        if (S.bit(0)==1) // check LSB of S
            S += M;

        S = S >> 1;   // Rightshift 1 bit
    }

    // HERE IS THE FINAL SUBTRACTION I WANT (NEED) TO AVOID
    if (S >= M)
    {
        S -= M;
    }

    outData = S.range(NUM_BITS-1,0);
}


//  montgomery modular exponentiation using square and multiply algorithm
//  computes  M^e modulo n, where we precompute the transformation of the 
//  base and running-partial sum into the montgomery domain 
function rsaModExp( rsaSize_t e,     // exponent 
                    rsaSize_t n,     // modulus
                    rsaSize_t Mbar,  // precomputed: montgomery residue of the base w.r.t. the radix--> (2^2048)*base mod n 
                    rsaSize_t xbar,  // precomputed: montgomery residue of 1  w.r.t. the radix--> 2^2048 mod n                 
                    rsaSize_t *out)  // result
{
    for (i=NUM_BITS-1; i>=0; i--)
    {
        montMult(xbar, xbar, n, xbar); // square
        if (e.bit(i)==1) 
            montMult(Mbar, xbar, n, xbar); // multiply
    }

    // undo montgomery transform
    montMult(xbar, 1, n, out);
}

Am I missing something in the papers? I do not believe this is an implementation error, as my code matches exactly what is put forth in the papers. I believe that I might be a conceptual error. Any and all help appreciated.

Thanks!

asmvolatile
  • 522
  • 5
  • 22

1 Answers1

2

Not sure what's wrong with your non-working implementation (if I understood well, what you show is a working one). In order to use the Walter optimization to compute M^e mod n, if your numbers all fit in 2048 bits, you need:

let NUM_BITS = 2050            // 2048 + 2
n < 2^(NUM_BITS - 2)           // precondition
M < 2 * n                      // precondition
let R = 2^(2 * NUM_BITS) mod n // pre-computed once for all
let M' = montMult(M, R, n)     // bring M in Montgomery form
let C' = montExp(M', e, n)     // Montgomery exponentiation
let C = montMult(C', 1, n)     // bring C' in normal form

Most important: do not forget to check the preconditions.

The Montgomery multiplications comprise NUM_BITS (2050 in your case) iterations (if-A-bit-set-add-B, if-odd-add-n, div-by-two), least significant bit first, and all numbers are represented on NUM_BITS (2050 in your case) bits.

The Montgomery exponentiation also comprises NUM_BITS (2050 in your case) iterations (square, if-e-bit-set-mult), most significant bit first, and all numbers are represented on NUM_BITS (2050 in your case) bits. Hope it helps.

Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
  • That was it. The problem was that I was computing R (or R^2 in the standard montgomery terminology) using 2048, yet iterating over 2050. I actually ended up trying 2050 for NUM_BITS and everything worked! Thank you. One additional question....in the higher radix case, where each operand is split into words, should NUM_BITS still be the same, since the total operand width does not change? – asmvolatile Aug 14 '17 at 19:38
  • for example, I'd like to break up the 2048-bit operands into 17-bit words, since the FPGA has 17-bit wide multiply accumulators like this (https://www.jstage.jst.go.jp/article/ijnc/1/2/1_277/_pdf). Would the walter optimization still hold if I instead run the loop for an additional 17-bits (additional word) and have R = 2^(2*NUM_BITS) mod n, where NUM_BITS = 17bits * 121words ? I ask if it should still hold, because this change breaks my implementation :) – asmvolatile Aug 14 '17 at 21:34
  • The Walter optimization should work even if you use more than 2 extra bits. I suspect that your problem is related to a wrong propagation of the carry bits or some other similar bug. Cross-check, maybe, that your addition still works, before moving up to modular multiplication and then exponentiation. By the way, the DSP48E1 blocks in Xilinx FPGAs contain a 48 bits adder, not 17 bits (see the [datasheet](https://www.xilinx.com/support/documentation/user_guides/ug479_7Series_DSP48E1.pdf))). – Renaud Pacalet Aug 16 '17 at 06:14
  • And as you will be using an FPGA, which main advantage over micro-processors is its potential level of parallelism, did you consider the possibility to use ultra-wide (2050 bits) Carry Save Adders instead of DSP blocks? If you have enough resource available (mainly D-Flip-Flops), this would allow you to compute each iteration of your modular multiplication in one clock cycle only, at about the same clock frequency... – Renaud Pacalet Aug 16 '17 at 06:19
  • **1)** The modular multiplication works, I'm just not able to get modExp to work, for both the normal method, and the walter method which is very odd. I've tried running it for 2045-2056 iterations, and none produces a correct result. But I've verified the MM works. And, yes I meant utilizing the 18-bit input to the multiply accumulator. **2)** I haven't tried explicitly using the CSA, but I have tried an implementation using entirely ap_uint<2048> values, and either utilization is too large, or it fails timing (WNS of ~-30ns). – asmvolatile Aug 16 '17 at 17:49
  • I can't really help you on the debugging part. DSP48E1 blocks can compute a **48** bits add (not 17 or 18). So, if all you want to use is their add function, it would really be better to use it completely (same resource utilization, same clock frequency). CSA are completely different from carry ripple. Their propagation delay does **not** depend on the size of the operands. They are ultra-fast and can really be used in your context, you should really have a look at them. – Renaud Pacalet Aug 17 '17 at 08:37
  • Thank you for the help, I will look into binding the HLS operations to the CSAs – asmvolatile Aug 17 '17 at 16:46