Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

Question

A question/problem for anyone experienced with Xilinx Vivado HLS and FPGA design:

I need help reducing the utilization numbers of a design within the confines of HLS (i.e. can't just redo the design in an HDL). I am targeting the Zedboard (Zynq 7020).

I'm trying to implement 2048-bit RSA in HLS, using the Tenca-koc multiple-word radix 2 montgomery multiplication algorithm, shown below (More algorithm details here):

I wrote this algorithm in HLS and it works in simulation and in C/RTL cosim. My algorithm is here:

#define MWR2MM_m 2048  // Bit-length of operands
#define MWR2MM_w 8     // word size
#define MWR2MM_e 257   // number of words per operand

// Type definitions
typedef ap_uint<1> bit_t;             // 1-bit scan
typedef ap_uint< MWR2MM_w > word_t;     // 8-bit words
typedef ap_uint< MWR2MM_m > rsaSize_t;  // m-bit operand size


/*
 * Multiple-word radix 2 montgomery multiplication using carry-propagate adder
 */
void mwr2mm_cpa(rsaSize_t X, rsaSize_t Yin, rsaSize_t Min, rsaSize_t* out)
{
    // extend operands to 2 extra words of 0
    ap_uint<MWR2MM_m + 2*MWR2MM_w> Y = Yin; 
    ap_uint<MWR2MM_m + 2*MWR2MM_w> M = Min;
    ap_uint<MWR2MM_m + 2*MWR2MM_w> S = 0;

    ap_uint<2> C = 0; // two carry bits
    bit_t qi = 0;     // an intermediate result bit

    // Store concatenations in a temporary variable to eliminate HLS compiler warnings about shift count
    ap_uint<MWR2MM_w> temp_concat=0; 

    //  scan X bit-by bit
    for (int i=0; i<MWR2MM_m; i++)
    {
        qi = (X[i]*Y[0]) xor S[0];

        // C gets top two bits of temp_concat, j'th word of S gets bottom 8 bits of temp_concat
        temp_concat = X[i]*Y.range(MWR2MM_w-1,0) + qi*M.range(MWR2MM_w-1,0) + S.range(MWR2MM_w-1,0);
        C = temp_concat.range(9,8);
        S.range(MWR2MM_w-1,0) = temp_concat.range(7,0);

        // scan Y and M word-by word, for each bit of X
        for (int j=1; j<=MWR2MM_e; j++)
        {
            temp_concat = C + X[i]*Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + qi*M.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) + S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j);
            C = temp_concat.range(9,8);
            S.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j) = temp_concat.range(7,0);

            S.range(MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)) = (S.bit(MWR2MM_w*j), S.range( MWR2MM_w*(j-1)+(MWR2MM_w-1), MWR2MM_w*(j-1)+1));
        }
        S.range(S.length()-1, S.length()-MWR2MM_w) = 0;
        C=0;
    }

    // if final partial sum is greater than the modulus, bring it back to proper range
    if (S >= M)
        S -= M;

    *out = S;
}

Unfortunately, the LUT utilization is huge.

This is problematic because I need to be able to fit multiple of these blocks in hardware as axi4-lite slaves.

Could someone please provide a few suggestions as to how I can reduce the LUT utilization, WITHIN THE CONFINES OF HLS?

I've already tried the following:

Experimenting with different word lengths
switching the top level inputs to arrays so they are BRAM (i.e. not using ap_uint<2048>, but instead ap_uint foo[MWR2MM_e])
Experimenting with all sorts of directives: compartmentalizing into multiple inline functions, dataflow architecture, resource limits on lshr, etc.

However, nothing really drives the LUT utilization down in a meaningful way. Is there a glaringly obvious way that I could reduce the utilization that is apparent to anyone?

In particular, I've seen papers on implementations of the mwr2mm algorithm that (only use one DSP block and one BRAM). Is this even worth attempting to implement using HLS? Or is there no way that I can actually control the resources that the algorithm is mapped to without describing it in HDL?

Thanks for the help.

Using HLS is not a resource efficient design methodology. Moreover, why are you reinventing the wheel? There are so many RSA implementations on the web ... — Paebbels, Aug 01 '17 at 21:28
@Paebbels We want to develop the RSA block in HLS to do just that! The company wants to evaluate HLS as a resource for speeding up the design flow, and this is a problem we have been working on to evaluate it. I know HLS it isn't the most efficient, but in this case I think that I must be doing something glaringly inefficient with that kind of utilization result. I think my inexperience with the toolsuite is what is causing the current hang-up, not the relative inefficiency of HLS to HDL. I just want to get down to ~30-40 LUT utilization and we will be happy — asmvolatile, Aug 01 '17 at 21:41
Plus, this seems like exactly the use case for HLS right? Translating a high-level algorithmic description directly into hardware... — asmvolatile, Aug 01 '17 at 21:43
This is the trade off of using high level descriptive languages like OpenCL or HLS. A FPGA RTL developer thinks about size, efficiency and meeting timing when the RTL is created WITHOUT shielding the developer from the complexities of RTL. Alternatively, the high level descriptive language is meant to shield the developer from complexity, and meet timing, but it is not meant to be efficient or small. You might find that there is a better way to write the high level description, to get better results, but I think it's more of an artifact of the methodology. — Rich Maes, Aug 01 '17 at 22:04
@RichMaes My thoughts exactly. That said, I still think that someone more experienced with the tools than me could at least point me in the right direction to help shrink this design down to at least a more tolerable range. — asmvolatile, Aug 02 '17 at 12:42
Apparently, you already understood that working at the bit-level for all your operands is an issue. You thus decided to split them in bytes (`MWR2MM_w`) and somehow process them at the byte level. I suspect that your problem is that you did not push this wise decision to its limit. Your `X` operand, for instance, is still parsed one bit at a time. Your HLS tool is not smart enough to handle this and ends up with a huge number of shift registers (`lshr` and `shl`). Try to help it a bit more. — Renaud Pacalet, Aug 04 '17 at 13:48
In order to force your HLS to used RAMs instead of registers, you should probably use arrays of bytes (`Y[j]`) instead of ranges of your 2048-bits words `Y.range(MWR2MM_w*j+(MWR2MM_w-1), MWR2MM_w*j)`). As far as I know these tools they are not really capable of inferring RAMs from anything else than arrays. — Renaud Pacalet, Aug 04 '17 at 13:53
@RenaudPacalet Thanks, missed this response. I actually ended up doing something just like that. Utilization is still not great, however it is much more manageable. Now I'm just trying to figure out how to do the final subtraction of the modulus from the partial sum, given that both of them are now stored in BRAM — asmvolatile, Aug 08 '17 at 13:23
@Brett *Now I'm just trying to figure out how to do the final subtraction*: Don't. Instead, read [this paper](https://comodojapan.com/research/crypto/CDW_ELL_99.ps) by Colin D. Walter. — Renaud Pacalet, Aug 08 '17 at 13:27
@RenaudPacalet I took your advice, and I have a follow up question for you posted here: https://stackoverflow.com/questions/45620060/final-subtraction-in-montgomery-modular-multiplication-for-an-rsa-cryptosystem. Thanks! — asmvolatile, Aug 10 '17 at 17:49

Reducing LUT utilization in a Vivado HLS design (RSA cryptosystem using montgomery multiplication)

0 Answers0