RSA hardware implementation: radix-2 montgomery multiplication issues

Question

I'm implementing RSA 1024 in hardware (xilinx ZYNQ FPGA), and am unable to figure out a few curious issues. Most notably, I am finding that my implementation only works for certain base/exponent/modulus combinations, but have not found any reason why this is the case.

Note: I am implementing the algorithm using Xilinx HLS (essentially C code that is synthesized into hardware). For the sake of this post, treat it just like a standard C implementation, except that I can have variables up to 4096 bits wide. I haven't yet parallelized it, so it should behave just like standard C code.

The Problem

My problem is that I am able to get the correct answer for certain modular exponentiation test problems, but only if the values for the base, exponent, and modulus can be written in much fewer bits than the actual 1024 bit operand width (i.e. they are zero padded).

When I use actual 1024-bit values generated from SSH-keygen, I no longer get the correct results.

For example, if my input arguments are

uint1024_t base     = 1570
uint1024_t exponent = 1019
uint1024_t modulus  = 3337

I correctly get a result of 1570^1029 mod(3337) = 688

However, when I actually use values that occupy all (or approximately all) 1024 bits for the inputs...

uint1024_t base     = 0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec920399f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3faf83c86bfdd6e9daad12559f8d2747
uint1024_t exponent = 0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab620fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafca72c9f3ca5bbf96b24c1345eb936d1
uint1024_t modulus  = 0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2db0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7030c5c004c5aea3cf99afe89b86d6d

I incorrectly get a massive number, rather than the correct answer of 29 (0x1D)

I've checked both algorithms a million times over, and have experimented with different initial values and loop bounds, but nothing seems to work.

My Implementation

I am using the standard square and multiply method for the modular exponentiation, and I chose to use the Tenca-Koc radix-2 algorithm for the montgomery multiplication, detailed in pseudocode below...

/* Tenca-Koc radix2 montgomery multiplication */
Z = 0
for i = 0 to n-1
    Z = Z + X[i]*Y
    if Z is odd then Z = Z + M
    Z = Z/2  // left shift in radix2
if (S >= M) then S = S - M

My Montgomery multiplication implementation is as follows:

void montMult(uint1024_t X, uint1024_t Y, uint1024_t M, uint1024_t* outData)
{
    ap_uint<2*NUM_BITS> S = 0; 

    for (int i=0; i<NUM_BITS; i++)
    {
        // add product of X.get_bit(i) and Y to partial sum
        S += X[i]*Y; 

        // if S is even, add modulus to partial sum
        if (S.test(0))  
            S += M;     

        // rightshift 1 bit (divide by 2)
        S = S >> 1;
    }

    // bring back to under 1024 bits by subtracting modulus
    if (S >= M)
        S -= M;

    // write output data
    *outData = S.range(NUM_BITS-1,0);

}

and my top-level modular exponentiation is as follows, where (switching notation!) ...

// k: number of bits
// r = 2^k (radix)
// M: base
// e: exponent
// n: modulus
// Mbar: (precomputed residue) M*r mod(n)
// xbar: (precomputed initial residue) 1*r mod(n)

void ModExp(uint1024_t M, uint1024_t e, uint1024_t n, 
            uint1024_t Mbar, uint1024_t xbar, uint1024_t* out)
{
    for (int i=NUM_BITS-1; i>=0; i--)
    {
        // square
        montMult(xbar,xbar,n,&xbar);

        // multiply   
        if (e.test(i)) // if (e.bit(i) == 1)
            montMult(Mbar,xbar,n,&xbar);
    }
        // undo montgomery residue transformation
        montMult(xbar,1,n,out);
}

I can't for the life of me figure out why this works for everything except an actual 1024 bit value. Any help would be much appreciated

Hmm, interesting but it will still mean debugging I guess. I'd strongly advice you to find another implementation using the same Montgomery ladder and try and find the intermediate values. — Maarten Bodewes, Jan 11 '17 at 01:12
I've already tried that and the values are mostly consistent, which is why I think its something conceptual. But its hard to say, since the values are so large that I don't have much intuition as to what should be right or wrong (and verifying values is often impossible given 1024 bit intermediate numbers) — asmvolatile, Jan 11 '17 at 01:34
Looks like the correct answer to your test example is 29 (not 25). I can't debug your code, I can only check correctness of Montgomery multiplication & exponentiation data. Do you always get correct results with small numbers? — kludg, Jan 11 '17 at 05:34
@kludg sorry, that was just a typo on my part, it is indeed 29. Yes, I always get correct values with small numbers — asmvolatile, Jan 11 '17 at 13:40
@j.p. where do you mean? They should be able to grow beyond 1024 because S (the partial sum) is 2*NUM_BITS long (2048, in this case), and the subtraction of the modulus at the end brings it back to under 1024 bits — asmvolatile, Jan 11 '17 at 13:42
In short: your `montMult` code is correct, your `ModExp` code is wrong. — kludg, Jan 11 '17 at 18:51
@kludg thanks for that test info, I will check it. Again, I really appreciate your help! — asmvolatile, Jan 11 '17 at 23:06
@kludg looks like I was correct in assuming the error was in montMult. I get a different value. I still don't understand how I can get the correct answer for both even and odd exponents though :O — asmvolatile, Jan 11 '17 at 23:13
Sorry, I was wrong - too late for me. See my updated answer. The algorithm is correct, I don't know what is wrong with your code. — kludg, Jan 11 '17 at 23:44
@kludg if you don't mind, could you link me to your BigInteger library? I'd like to compile your Delphi and step through it with a debugger to check intermediate values. I understand if you don't want to share it, but in case it is hosted online, I'd love to look at it. — asmvolatile, Jan 12 '17 at 00:14

kludg · Answer 1 · 2017-01-12T01:00:20.073

I've replaced my answer because I was wrong. Your original code is perfectly correct. I've tested it using my own BigInteger library, which includes Montgomery arithmetic, and everything works like a charm. Here is my code:

const
  base1     =
 '0x00be5416af9696937b7234421f7256f78dba8001c80a5fdecdb4ed761f2b7f955946ec9203'+
 '99f23ce9627f66286239d3f20e7a46df185946c6c8482e227b9ce172dd518202381706ed0f91'+
 'b53c5436f233dec27e8cb46c4478f0398d2c254021a7c21596b30f77e9886e2fd2a081cadd3f'+
 'af83c86bfdd6e9daad12559f8d2747';
  exponent1 =
 '0x6f1e6ab386677cdc86a18f24f42073b328847724fbbd293eee9cdec29ac4dfe953a4256d7e'+
 '6b9abee426db3b4ddc367a9fcf68ff168a7000d3a7fa8b9d9064ef4f271865045925660fab62'+
 '0fad0aeb58f946e33bdff6968f4c29ac62bd08cf53cb8be2116f2c339465a64fd02517f2bafc'+
 'a72c9f3ca5bbf96b24c1345eb936d1';
  modulus1  =
 '0xb4d92132b03210f62e52129ae31ef25e03c2dd734a7235efd36bad80c28885f3a9ee1ab626'+
 'c30072bb3fd9906bf89a259ffd9d5fd75f87a30d75178b9579b257b5dca13ca7546866ad9f2d'+
 'b0072d59335fb128b7295412dd5c43df2c4f2d2f9c1d59d2bb444e6dac1d9cef27190a97aae7'+
 '030c5c004c5aea3cf99afe89b86d6d';

function MontMult(X, Y, N: BigInteger): BigInteger;
var
  I: Integer;
begin
  Result:= 0;
  for I:= 0 to 1023 do begin
    if not X.IsEven then Result:= Result + Y;
    if not Result.IsEven then Result:= Result + N;
    Result:= Result shr 1;
    X:= X shr 1;
  end;
  if Result >= N then Result:= Result - N;
end;

function ModExp(B, E, N: BigInteger): BigInteger;
var
  R, MontB: BigInteger;
  I: Integer;

begin
  R:= BigInteger.PowerOfTwo(1024) mod N;
  MontB:= (B * R) mod N;
  for I:= 1023 downto 0 do begin
    R:= MontMult(R, R, N);
    if not (E shr I).IsEven then
      R:= MontMult(MontB, R, N);
  end;
  Result:= MontMult(R, 1, N);
end;

procedure TestMontMult;
var
  Base, Expo, Modulus: BigInteger;
  MontBase, MontExpo: BigInteger;
  X, Y, R: BigInteger;
  Mont: TMont;

begin
// convert to BigInteger
  Base:= BigInteger.Parse(base1);
  Expo:= BigInteger.Parse(exponent1);
  Modulus:= BigInteger.Parse(modulus1);

  R:= BigInteger.PowerOfTwo(1024) mod Modulus;
// Convert into Montgomery form
  MontBase:= (Base * R) mod Modulus;
  MontExpo:= (Expo * R) mod Modulus;
  Writeln;

// MontMult test, all 3 versions output
//  '0x146005377258684F3FFD8D9A70D723BDD3A2E3A160E11B7AD35A7106D4D903AB9D14A9201'+
//  'D0907CE2FC2E04A69656C38CE64AA0BADF2376AEFB19D8732CE2B3650466E31BB78CF24F4E3'+
//  '774A78575738B668DA0E40C8DDDA972CE101E0CADC5D4CCFF6EF2E4E97AF02F34E3AB7258A7'+
//  '323E472FC051825FFC72ADC53B0DAF3C4';
  Writeln('Using MontMult');
  Writeln(MontMult(MontMult(MontBase, MontExpo, Modulus), 1, Modulus).ToHexString);
// same using TMont instance
  Writeln('Using TMont.Multiply');
  Mont:= TMont.GetInstance(Modulus);
  Writeln(Mont.Reduce(Mont.Multiply(MontBase, MontExpo)).ToHexString);
  Writeln('Using TMont.ModMul');
  Writeln(Mont.ModMul(Base,Expo).ToHexString);

// ModExp test, all 3 versions output 29
  Writeln('Using ModExp');
  Writeln(ModExp(Base, Expo, Modulus).ToString);
  Writeln('Using BigInteger.ModPow');
  Writeln(BigInteger.ModPow(Base, Expo, Modulus).ToString);
  Writeln('Using TMont.ModPow');
  Writeln(Mont.ModPow(Base, Expo).ToString);
end;

I appreciate the response! Unfortunately, I don't think that was the issue. I created a new local variable `xbar_temp` and changed the second call to montMult from `montMult(Mbar,xbar,n, &xbar)` to `montMult(Mbar,xbar,n &xbar_temp` and then on the next line store the value back in `xbar`. I also then ultimately write the result after the last multiplication `montMult(xbar_temp,1,n, &out), essentially duplicating your code, but this doesn't change the resultant value. And also, if my modExp function was wrong, then I shouldn't be getting the correct answer for lower bit widths! — asmvolatile, Jan 11 '17 at 21:34
more readable version of the edited modExp is [HERE](http://pastebin.com/1WueyNpN). — asmvolatile, Jan 11 '17 at 21:42
Again, I disagree (not trying to argue, just want to make sure that I understand!). I have verified that It works for even exponents as well, so that is not the case. In fact, your code as written did not work for me when I translated it to C for an even exponent, until I changed the last line from `Result:=MontMult(RR,1,N)` to `Result:=MontMult(R,1,N)`. Otherwise it missed the last montMult by the modulus. But mine as it was before worked fine for **both even and odd** exponents as long as they were zero padded. This is why I still think that it's a conceptual issue, rather than a code bug. — asmvolatile, Jan 11 '17 at 22:52
Sorry, I misspoke above. I meant to say that your modExp didn't work when I translated it because **the last residue squaring operation is never stored back into RR**, which is what eventually needs to be montMultiplied by 1 to get the final answer. But mine, as it was before, worked fine for **both even and odd** exponents as long as they were zero padded. This is why I still think that it's a conceptual issue, rather than a code bug. — asmvolatile, Jan 11 '17 at 23:04

score 1 · Accepted Answer · answered Jan 15 '17 at 18:19

Update: I finally was able to fix the issue, after I ported my design to Java to check my intermediate values in the debugger. The design ran flawlessly in Java with no modifications to the code structure, and this tipped me off as to what was going wrong.

The problem came to light after getting correct intermediate values using the BigInteger java package. The HLS arbitrary precision library has a fixed bitwidth (obviously, since it synthesizes down to hardware), whereas the software BigInteger libraries are flexible bit widths. It turns out that the addition operator treats both arguments as signed values if they are different bit-widths, despite the fact that I declared them as unsigned. Thus, when there was a 1 in the MSB of an intermediate value and I tried to add it to a greater value, it treated the MSB as a sign bit and attempted to sign extend it.

This did not happen with the Java BigInt library, which quickly pointed me towards the problem.

If anyone is interested in a Java implementation of modular exponentiation using the Tenca-Koc radix2 algorithm for montgomery multiplication, you can find the code here: https://github.com/bigbrett/MontModExp-radix2

RSA hardware implementation: radix-2 montgomery multiplication issues

The Problem

My Implementation

2 Answers2